Sharing data and models in software engineering
Data Science for Software Engineering: Sharing Data and Models presents guidance and procedures for reusing data and models between projects to produce results that are useful and relevant. Starting with a background section of practical lessons and warnings for beginner data scientists for software...
Otros Autores: | , |
---|---|
Formato: | Libro electrónico |
Idioma: | Inglés |
Publicado: |
Waltham, Massachusetts :
Morgan Kaufmann
2015.
|
Edición: | First edition |
Materias: | |
Ver en Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009629837106719 |
Tabla de Contenidos:
- Front Cover
- Sharing Data and Models in Software Engineering
- Copyright
- Why this book?
- Foreword
- Contents
- List of Figures
- Chapter 1: Introduction
- 1.1 Why Read This Book?
- 1.2 What Do We Mean by ``Sharing''?
- 1.2.1 Sharing Insights
- 1.2.2 Sharing Models
- 1.2.3 Sharing Data
- 1.2.4 Sharing Analysis Methods
- 1.2.5 Types of Sharing
- 1.2.6 Challenges with Sharing
- 1.2.7 How to Share
- 1.3 What? (Our Executive Summary)
- 1.3.1 An Overview
- 1.3.2 More Details
- 1.4 How to Read This Book
- 1.4.1 Data Analysis Patterns
- 1.5 But What About …? (What Is Not in This Book)
- 1.5.1 What About ``Big Data''?
- 1.5.2 What About Related Work?
- 1.5.3 Why All the Defect Prediction and Effort Estimation?
- 1.6 Who? (About the Authors)
- 1.7 Who Else? (Acknowledgments)
- Part I: Data Mining for Managers
- Chapter 2: Rules for Managers
- 2.1 The Inductive Engineering Manifesto
- 2.2 More Rules
- Chapter 3: Rule #1: Talk to the Users
- 3.1 Users Biases
- 3.2 Data Mining Biases
- 3.3 Can We Avoid Bias?
- 3.4 Managing Biases
- 3.5 Summary
- Chapter 4: Rule #2: Know the Domain
- 4.1 Cautionary Tale #1: ``Discovering'' Random Noise
- 4.2 Cautionary Tale #2: Jumping at Shadows
- 4.3 Cautionary Tale #3: It Pays to Ask
- 4.4 Summary
- Chapter 5: Rule #3: Suspect Your Data
- 5.1 Controlling Data Collection
- 5.2 Problems with Controlled Data Collection
- 5.3 Rinse (and Prune) Before Use
- 5.3.1 Row Pruning
- 5.3.2 Column Pruning
- 5.4 On the Value of Pruning
- 5.5 Summary
- Chapter 6: Rule #4: Data Science Is Cyclic
- 6.1 The Knowledge Discovery Cycle
- 6.2 Evolving Cyclic Development
- 6.2.1 Scouting
- 6.2.2 Surveying
- 6.2.3 Building
- 6.2.4 Effort
- 6.3 Summary
- Part II: Data Mining: A Technical Tutorial
- Chapter 7: Data Mining and SE
- 7.1 Some Definitions
- 7.2 Some Application Areas.
- Chapter 8: Defect Prediction
- 8.1 Defect Detection Economics
- 8.2 Static Code Defect Prediction
- 8.2.1 Easy to Use
- 8.2.2 Widely Used
- 8.2.3 Useful
- Chapter 9: Effort Estimation
- 9.1 The Estimation Problem
- 9.2 How to Make Estimates
- 9.2.1 Expert-Based Estimation
- 9.2.2 Model-Based Estimation
- 9.2.3 Hybrid Methods
- Chapter 10: Data Mining (Under the Hood)
- 10.1 Data Carving
- 10.2 About the Data
- 10.3 Cohen Pruning
- 10.4 Discretization
- 10.4.1 Other Discretization Methods
- 10.5 Column Pruning
- 10.6 Row Pruning
- 10.7 Cluster Pruning
- 10.7.1 Advantages of Prototypes
- 10.7.2 Advantages of Clustering
- 10.8 Contrast Pruning
- 10.9 Goal Pruning
- 10.10 Extensions for Continuous Classes
- 10.10.1 How RTs Work
- 10.10.2 Creating Splits for Categorical Input Features
- 10.10.3 Splits on Numeric Input Features
- 10.10.4 Termination Condition and Predictions
- 10.10.5 Potential Advantages of RTs for Software Effort Estimation
- 10.10.6 Predictions for Multiple Numeric Goals
- Part III: Sharing Data
- Chapter 11: Sharing Data: Challenges and Methods
- 11.1 Houston, We Have a Problem
- 11.2 Good News, Everyone
- Chapter 12: Learning Contexts
- 12.1 Background
- 12.2 Manual Methods for Contextualization
- 12.3 Automatic Methods
- 12.4 Other Motivation to Find Contexts
- 12.4.1 Variance Reduction
- 12.4.2 Anomaly Detection
- 12.4.3 Certification Envelopes
- 12.4.4 Incremental Learning
- 12.4.5 Compression
- 12.4.6 Optimization
- 12.5 How to Find Local Regions
- 12.5.1 License
- 12.5.2 Installing CHUNK
- 12.5.3 Testing Your Installation
- 12.5.4 Applying CHUNK to Other Models
- 12.6 Inside CHUNK
- 12.6.1 Roadmap to Functions
- 12.6.2 Distance Calculations
- 12.6.2.1 Normalize
- 12.6.2.2 SquaredDifference
- 12.6.3 Dividing the Data
- 12.6.3.1 FastDiv
- 12.6.3.2 TwoDistantPoints.
- 12.6.3.3 Settings
- 12.6.3.4 Chunk (main function)
- 12.6.4 Support Utilities
- 12.6.4.1 Some standard tricks
- 12.6.4.2 Tree iterators
- 12.6.4.3 Pretty printing
- 12.7 Putting It all Together
- 12.7.1 _nasa93
- 12.8 Using CHUNK
- 12.9 Closing Remarks
- Chapter 13: Cross-Company Learning: Handling the Data Drought
- 13.1 Motivation
- 13.2 Setting the Ground for Analyses
- 13.2.1 Wait … Is This Really CC Data?
- 13.2.2 Mining the Data
- 13.2.3 Magic Trick: NN Relevancy Filtering
- 13.3 Analysis #1: Can CC Data be Useful for an Organization?
- 13.3.1 Design
- 13.3.2 Results from Analysis #1
- 13.3.3 Checking the Analysis #1 Results
- 13.3.4 Discussion of Analysis #1
- 13.4 Analysis #2: How to Cleanup CC Data for Local Tuning?
- 13.4.1 Design
- 13.4.2 Results
- 13.4.3 Discussions
- 13.5 Analysis #3: How Much Local Data Does an Organization Need for a Local Model?
- 13.5.1 Design
- 13.5.2 Results from Analysis #3
- 13.5.3 Checking the Analysis #3 Results
- 13.5.4 Discussion of Analysis #3
- 13.6 How Trustworthy Are These Results?
- 13.7 Are These Useful in Practice or Just Number Crunching?
- 13.8 What's New on Cross-Learning?
- 13.8.1 Discussion
- 13.9 What's the Takeaway?
- Chapter 14: Building Smarter Transfer Learners
- 14.1 What Is Actually the Problem?
- 14.2 What Do We Know So Far?
- 14.2.1 Transfer Learning
- 14.2.2 Transfer Learning and SE
- 14.2.3 Data Set Shift
- 14.3 An Example Technology: TEAK
- 14.4 The Details of the Experiments
- 14.4.1 Performance Comparison
- 14.4.2 Performance Measures
- 14.4.3 Retrieval Tendency
- 14.5 Results
- 14.5.1 Performance Comparison
- 14.5.2 Inspecting Selection Tendencies
- 14.6 Discussion
- 14.7 What Are the Takeaways?
- Chapter 15: Sharing Less Data (Is a Good Thing)
- 15.1 Can We Share Less Data?
- 15.2 Using Less Data
- 15.3 Why Share Less Data?.
- 15.3.1 Less Data Is More Reliable
- 15.3.2 Less Data Is Faster to Discuss
- 15.3.3 Less Data Is Easier to Process
- 15.4 How to Find Less Data
- 15.4.1 Input
- 15.4.2 Comparisons to Other Learners
- 15.4.3 Reporting the Results
- 15.4.4 Discussion of Results
- 15.5 What's Next?
- Chapter 16: How to Keep Your Data Private
- 16.1 Motivation
- 16.2 What Is PPDP and Why Is It Important?
- 16.3 What Is Considered a Breach of Privacy?
- 16.4 How to Avoid Privacy Breaches?
- 16.4.1 Generalization and Suppression
- 16.4.2 Anatomization and Permutation
- 16.4.3 Perturbation
- 16.4.4 Output Perturbation
- 16.5 How Are Privacy-Preserving Algorithms Evaluated?
- 16.5.1 Privacy Metrics
- 16.5.2 Modeling the Background Knowledge of an Attacker
- 16.6 Case Study: Privacy and Cross-Company Defect Prediction
- 16.6.1 Results and Contributions
- 16.6.2 Privacy and CCDP
- 16.6.3 CLIFF
- 16.6.4 MORPH
- 16.6.5 Example of CLIFF&
- MORPH
- 16.6.6 Evaluation Metrics
- 16.6.7 Evaluating Utility via Classification
- 16.6.8 Evaluating Privatization
- 16.6.8.1 Defining privacy
- 16.6.9 Experiments
- 16.6.9.1 Data
- 16.6.10 Design
- 16.6.11 Defect Predictors
- 16.6.12 Query Generator
- 16.6.13 Benchmark Privacy Algorithms
- 16.6.14 Experimental Evaluation
- 16.6.15 Discussion
- 16.6.16 Related Work: Privacy in SE
- 16.6.17 Summary
- Chapter 17: Compensating for Missing Data
- 17.1 Background Notes on SEE and Instance Selection
- 17.1.1 Software Effort Estimation
- 17.1.2 Instance Selection in SEE
- 17.2 Data Sets and Performance Measures
- 17.2.1 Data Sets
- 17.2.2 Error Measures
- 17.3 Experimental Conditions
- 17.3.1 The Algorithms Adopted
- 17.3.2 Proposed Method: POP1
- 17.3.3 Experiments
- 17.4 Results
- 17.4.1 Results Without Instance Selection
- 17.4.2 Results with Instance Selection
- 17.5 Summary.
- Chapter 18: Active Learning: Learning More with Less
- 18.1 How Does the QUICK Algorithm Work?
- 18.1.1 Getting Rid of Similar Features: Synonym Pruning
- 18.1.2 Getting Rid of Dissimilar Instances: Outlier Pruning
- 18.2 Notes on Active Learning
- 18.3 The Application and Implementation Details of QUICK
- 18.3.1 Phase 1: Synonym Pruning
- 18.3.2 Phase 2: Outlier Removal and Estimation
- 18.3.3 Seeing QUICK in Action with a Toy Example
- 18.3.3.1 Phase 1: Synonym pruning
- 18.3.3.2 Phase 2: Outlier removal and estimation
- 18.4 How the Experiments Are Designed
- 18.5 Results
- 18.5.1 Performance
- 18.5.2 Reduction via Synonym and Outlier Pruning
- 18.5.3 Comparison of QUICK vs. CART
- 18.5.4 Detailed Look at the Statistical Analysis
- 18.5.5 Early Results on Defect Data Sets
- 18.6 Summary
- Part IV: Sharing Models
- Chapter 19: Sharing Models: Challenges and Methods
- Chapter 20: Ensembles of Learning Machines
- 20.1 When and Why Ensembles Work
- 20.1.1 Intuition
- 20.1.2 Theoretical Foundation
- 20.2 Bootstrap Aggregating (Bagging)
- 20.2.1 How Bagging Works
- 20.2.2 When and Why Bagging Works
- 20.2.3 Potential Advantages of Bagging for SEE
- 20.3 Regression Trees (RTs) for Bagging
- 20.4 Evaluation Framework
- 20.4.1 Choice of Data Sets and Preprocessing Techniques
- 20.4.1.1 PROMISE data
- 20.4.1.2 ISBSG data
- 20.4.2 Choice of Learning Machines
- 20.4.3 Choice of Evaluation Methods
- 20.4.4 Choice of Parameters
- 20.5 Evaluation of Bagging+RTs in SEE
- 20.5.1 Friedman Ranking
- 20.5.2 Approaches Most Often Ranked First or Second in Terms of MAE, MMRE and PRED(25)
- 20.5.3 Magnitude of Performance Against the Best
- 20.5.4 Discussion
- 20.6 Further Understanding of Bagging+RTs in SEE
- 20.7 Summary
- Chapter 21: How to Adapt Models in a Dynamic World
- 21.1 Cross-Company Data and Questions Tackled.
- 21.2 Related Work.