Sharing data and models in software engineering

Data Science for Software Engineering: Sharing Data and Models presents guidance and procedures for reusing data and models between projects to produce results that are useful and relevant. Starting with a background section of practical lessons and warnings for beginner data scientists for software...

Descripción completa

Detalles Bibliográficos
Otros Autores:	Menzies, Tim, author (author), Rogers, Mark, designer (designer)
Formato:	Libro electrónico
Idioma:	Inglés
Publicado:	Waltham, Massachusetts : Morgan Kaufmann 2015.
Edición:	First edition
Materias:	Software engineering. Computer-aided software engineering.
Ver en Biblioteca Universitat Ramon Llull:	https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009629837106719

Tabla de Contenidos:

Front Cover
Sharing Data and Models in Software Engineering
Copyright
Why this book?
Foreword
Contents
List of Figures
Chapter 1: Introduction
1.1 Why Read This Book?
1.2 What Do We Mean by ``Sharing''?
1.2.1 Sharing Insights
1.2.2 Sharing Models
1.2.3 Sharing Data
1.2.4 Sharing Analysis Methods
1.2.5 Types of Sharing
1.2.6 Challenges with Sharing
1.2.7 How to Share
1.3 What? (Our Executive Summary)
1.3.1 An Overview
1.3.2 More Details
1.4 How to Read This Book
1.4.1 Data Analysis Patterns
1.5 But What About …? (What Is Not in This Book)
1.5.1 What About ``Big Data''?
1.5.2 What About Related Work?
1.5.3 Why All the Defect Prediction and Effort Estimation?
1.6 Who? (About the Authors)
1.7 Who Else? (Acknowledgments)
Part I: Data Mining for Managers
Chapter 2: Rules for Managers
2.1 The Inductive Engineering Manifesto
2.2 More Rules
Chapter 3: Rule #1: Talk to the Users
3.1 Users Biases
3.2 Data Mining Biases
3.3 Can We Avoid Bias?
3.4 Managing Biases
3.5 Summary
Chapter 4: Rule #2: Know the Domain
4.1 Cautionary Tale #1: ``Discovering'' Random Noise
4.2 Cautionary Tale #2: Jumping at Shadows
4.3 Cautionary Tale #3: It Pays to Ask
4.4 Summary
Chapter 5: Rule #3: Suspect Your Data
5.1 Controlling Data Collection
5.2 Problems with Controlled Data Collection
5.3 Rinse (and Prune) Before Use
5.3.1 Row Pruning
5.3.2 Column Pruning
5.4 On the Value of Pruning
5.5 Summary
Chapter 6: Rule #4: Data Science Is Cyclic
6.1 The Knowledge Discovery Cycle
6.2 Evolving Cyclic Development
6.2.1 Scouting
6.2.2 Surveying
6.2.3 Building
6.2.4 Effort
6.3 Summary
Part II: Data Mining: A Technical Tutorial
Chapter 7: Data Mining and SE
7.1 Some Definitions
7.2 Some Application Areas.
Chapter 8: Defect Prediction
8.1 Defect Detection Economics
8.2 Static Code Defect Prediction
8.2.1 Easy to Use
8.2.2 Widely Used
8.2.3 Useful
Chapter 9: Effort Estimation
9.1 The Estimation Problem
9.2 How to Make Estimates
9.2.1 Expert-Based Estimation
9.2.2 Model-Based Estimation
9.2.3 Hybrid Methods
Chapter 10: Data Mining (Under the Hood)
10.1 Data Carving
10.2 About the Data
10.3 Cohen Pruning
10.4 Discretization
10.4.1 Other Discretization Methods
10.5 Column Pruning
10.6 Row Pruning
10.7 Cluster Pruning
10.7.1 Advantages of Prototypes
10.7.2 Advantages of Clustering
10.8 Contrast Pruning
10.9 Goal Pruning
10.10 Extensions for Continuous Classes
10.10.1 How RTs Work
10.10.2 Creating Splits for Categorical Input Features
10.10.3 Splits on Numeric Input Features
10.10.4 Termination Condition and Predictions
10.10.5 Potential Advantages of RTs for Software Effort Estimation
10.10.6 Predictions for Multiple Numeric Goals
Part III: Sharing Data
Chapter 11: Sharing Data: Challenges and Methods
11.1 Houston, We Have a Problem
11.2 Good News, Everyone
Chapter 12: Learning Contexts
12.1 Background
12.2 Manual Methods for Contextualization
12.3 Automatic Methods
12.4 Other Motivation to Find Contexts
12.4.1 Variance Reduction
12.4.2 Anomaly Detection
12.4.3 Certification Envelopes
12.4.4 Incremental Learning
12.4.5 Compression
12.4.6 Optimization
12.5 How to Find Local Regions
12.5.1 License
12.5.2 Installing CHUNK
12.5.3 Testing Your Installation
12.5.4 Applying CHUNK to Other Models
12.6 Inside CHUNK
12.6.1 Roadmap to Functions
12.6.2 Distance Calculations
12.6.2.1 Normalize
12.6.2.2 SquaredDifference
12.6.3 Dividing the Data
12.6.3.1 FastDiv
12.6.3.2 TwoDistantPoints.
12.6.3.3 Settings
12.6.3.4 Chunk (main function)
12.6.4 Support Utilities
12.6.4.1 Some standard tricks
12.6.4.2 Tree iterators
12.6.4.3 Pretty printing
12.7 Putting It all Together
12.7.1 _nasa93
12.8 Using CHUNK
12.9 Closing Remarks
Chapter 13: Cross-Company Learning: Handling the Data Drought
13.1 Motivation
13.2 Setting the Ground for Analyses
13.2.1 Wait … Is This Really CC Data?
13.2.2 Mining the Data
13.2.3 Magic Trick: NN Relevancy Filtering
13.3 Analysis #1: Can CC Data be Useful for an Organization?
13.3.1 Design
13.3.2 Results from Analysis #1
13.3.3 Checking the Analysis #1 Results
13.3.4 Discussion of Analysis #1
13.4 Analysis #2: How to Cleanup CC Data for Local Tuning?
13.4.1 Design
13.4.2 Results
13.4.3 Discussions
13.5 Analysis #3: How Much Local Data Does an Organization Need for a Local Model?
13.5.1 Design
13.5.2 Results from Analysis #3
13.5.3 Checking the Analysis #3 Results
13.5.4 Discussion of Analysis #3
13.6 How Trustworthy Are These Results?
13.7 Are These Useful in Practice or Just Number Crunching?
13.8 What's New on Cross-Learning?
13.8.1 Discussion
13.9 What's the Takeaway?
Chapter 14: Building Smarter Transfer Learners
14.1 What Is Actually the Problem?
14.2 What Do We Know So Far?
14.2.1 Transfer Learning
14.2.2 Transfer Learning and SE
14.2.3 Data Set Shift
14.3 An Example Technology: TEAK
14.4 The Details of the Experiments
14.4.1 Performance Comparison
14.4.2 Performance Measures
14.4.3 Retrieval Tendency
14.5 Results
14.5.1 Performance Comparison
14.5.2 Inspecting Selection Tendencies
14.6 Discussion
14.7 What Are the Takeaways?
Chapter 15: Sharing Less Data (Is a Good Thing)
15.1 Can We Share Less Data?
15.2 Using Less Data
15.3 Why Share Less Data?.
15.3.1 Less Data Is More Reliable
15.3.2 Less Data Is Faster to Discuss
15.3.3 Less Data Is Easier to Process
15.4 How to Find Less Data
15.4.1 Input
15.4.2 Comparisons to Other Learners
15.4.3 Reporting the Results
15.4.4 Discussion of Results
15.5 What's Next?
Chapter 16: How to Keep Your Data Private
16.1 Motivation
16.2 What Is PPDP and Why Is It Important?
16.3 What Is Considered a Breach of Privacy?
16.4 How to Avoid Privacy Breaches?
16.4.1 Generalization and Suppression
16.4.2 Anatomization and Permutation
16.4.3 Perturbation
16.4.4 Output Perturbation
16.5 How Are Privacy-Preserving Algorithms Evaluated?
16.5.1 Privacy Metrics
16.5.2 Modeling the Background Knowledge of an Attacker
16.6 Case Study: Privacy and Cross-Company Defect Prediction
16.6.1 Results and Contributions
16.6.2 Privacy and CCDP
16.6.3 CLIFF
16.6.4 MORPH
16.6.5 Example of CLIFF&amp
MORPH
16.6.6 Evaluation Metrics
16.6.7 Evaluating Utility via Classification
16.6.8 Evaluating Privatization
16.6.8.1 Defining privacy
16.6.9 Experiments
16.6.9.1 Data
16.6.10 Design
16.6.11 Defect Predictors
16.6.12 Query Generator
16.6.13 Benchmark Privacy Algorithms
16.6.14 Experimental Evaluation
16.6.15 Discussion
16.6.16 Related Work: Privacy in SE
16.6.17 Summary
Chapter 17: Compensating for Missing Data
17.1 Background Notes on SEE and Instance Selection
17.1.1 Software Effort Estimation
17.1.2 Instance Selection in SEE
17.2 Data Sets and Performance Measures
17.2.1 Data Sets
17.2.2 Error Measures
17.3 Experimental Conditions
17.3.1 The Algorithms Adopted
17.3.2 Proposed Method: POP1
17.3.3 Experiments
17.4 Results
17.4.1 Results Without Instance Selection
17.4.2 Results with Instance Selection
17.5 Summary.
Chapter 18: Active Learning: Learning More with Less
18.1 How Does the QUICK Algorithm Work?
18.1.1 Getting Rid of Similar Features: Synonym Pruning
18.1.2 Getting Rid of Dissimilar Instances: Outlier Pruning
18.2 Notes on Active Learning
18.3 The Application and Implementation Details of QUICK
18.3.1 Phase 1: Synonym Pruning
18.3.2 Phase 2: Outlier Removal and Estimation
18.3.3 Seeing QUICK in Action with a Toy Example
18.3.3.1 Phase 1: Synonym pruning
18.3.3.2 Phase 2: Outlier removal and estimation
18.4 How the Experiments Are Designed
18.5 Results
18.5.1 Performance
18.5.2 Reduction via Synonym and Outlier Pruning
18.5.3 Comparison of QUICK vs. CART
18.5.4 Detailed Look at the Statistical Analysis
18.5.5 Early Results on Defect Data Sets
18.6 Summary
Part IV: Sharing Models
Chapter 19: Sharing Models: Challenges and Methods
Chapter 20: Ensembles of Learning Machines
20.1 When and Why Ensembles Work
20.1.1 Intuition
20.1.2 Theoretical Foundation
20.2 Bootstrap Aggregating (Bagging)
20.2.1 How Bagging Works
20.2.2 When and Why Bagging Works
20.2.3 Potential Advantages of Bagging for SEE
20.3 Regression Trees (RTs) for Bagging
20.4 Evaluation Framework
20.4.1 Choice of Data Sets and Preprocessing Techniques
20.4.1.1 PROMISE data
20.4.1.2 ISBSG data
20.4.2 Choice of Learning Machines
20.4.3 Choice of Evaluation Methods
20.4.4 Choice of Parameters
20.5 Evaluation of Bagging+RTs in SEE
20.5.1 Friedman Ranking
20.5.2 Approaches Most Often Ranked First or Second in Terms of MAE, MMRE and PRED(25)
20.5.3 Magnitude of Performance Against the Best
20.5.4 Discussion
20.6 Further Understanding of Bagging+RTs in SEE
20.7 Summary
Chapter 21: How to Adapt Models in a Dynamic World
21.1 Cross-Company Data and Questions Tackled.
21.2 Related Work.

Sharing data and models in software engineering

Ejemplares similares