Sharing data and models in software engineering

Data Science for Software Engineering: Sharing Data and Models presents guidance and procedures for reusing data and models between projects to produce results that are useful and relevant. Starting with a background section of practical lessons and warnings for beginner data scientists for software...

Descripción completa

Detalles Bibliográficos
Otros Autores: Menzies, Tim, author (author), Rogers, Mark, designer (designer)
Formato: Libro electrónico
Idioma:Inglés
Publicado: Waltham, Massachusetts : Morgan Kaufmann 2015.
Edición:First edition
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009629837106719
Tabla de Contenidos:
  • Front Cover
  • Sharing Data and Models in Software Engineering
  • Copyright
  • Why this book?
  • Foreword
  • Contents
  • List of Figures
  • Chapter 1: Introduction
  • 1.1 Why Read This Book?
  • 1.2 What Do We Mean by ``Sharing''?
  • 1.2.1 Sharing Insights
  • 1.2.2 Sharing Models
  • 1.2.3 Sharing Data
  • 1.2.4 Sharing Analysis Methods
  • 1.2.5 Types of Sharing
  • 1.2.6 Challenges with Sharing
  • 1.2.7 How to Share
  • 1.3 What? (Our Executive Summary)
  • 1.3.1 An Overview
  • 1.3.2 More Details
  • 1.4 How to Read This Book
  • 1.4.1 Data Analysis Patterns
  • 1.5 But What About …? (What Is Not in This Book)
  • 1.5.1 What About ``Big Data''?
  • 1.5.2 What About Related Work?
  • 1.5.3 Why All the Defect Prediction and Effort Estimation?
  • 1.6 Who? (About the Authors)
  • 1.7 Who Else? (Acknowledgments)
  • Part I: Data Mining for Managers
  • Chapter 2: Rules for Managers
  • 2.1 The Inductive Engineering Manifesto
  • 2.2 More Rules
  • Chapter 3: Rule #1: Talk to the Users
  • 3.1 Users Biases
  • 3.2 Data Mining Biases
  • 3.3 Can We Avoid Bias?
  • 3.4 Managing Biases
  • 3.5 Summary
  • Chapter 4: Rule #2: Know the Domain
  • 4.1 Cautionary Tale #1: ``Discovering'' Random Noise
  • 4.2 Cautionary Tale #2: Jumping at Shadows
  • 4.3 Cautionary Tale #3: It Pays to Ask
  • 4.4 Summary
  • Chapter 5: Rule #3: Suspect Your Data
  • 5.1 Controlling Data Collection
  • 5.2 Problems with Controlled Data Collection
  • 5.3 Rinse (and Prune) Before Use
  • 5.3.1 Row Pruning
  • 5.3.2 Column Pruning
  • 5.4 On the Value of Pruning
  • 5.5 Summary
  • Chapter 6: Rule #4: Data Science Is Cyclic
  • 6.1 The Knowledge Discovery Cycle
  • 6.2 Evolving Cyclic Development
  • 6.2.1 Scouting
  • 6.2.2 Surveying
  • 6.2.3 Building
  • 6.2.4 Effort
  • 6.3 Summary
  • Part II: Data Mining: A Technical Tutorial
  • Chapter 7: Data Mining and SE
  • 7.1 Some Definitions
  • 7.2 Some Application Areas.
  • Chapter 8: Defect Prediction
  • 8.1 Defect Detection Economics
  • 8.2 Static Code Defect Prediction
  • 8.2.1 Easy to Use
  • 8.2.2 Widely Used
  • 8.2.3 Useful
  • Chapter 9: Effort Estimation
  • 9.1 The Estimation Problem
  • 9.2 How to Make Estimates
  • 9.2.1 Expert-Based Estimation
  • 9.2.2 Model-Based Estimation
  • 9.2.3 Hybrid Methods
  • Chapter 10: Data Mining (Under the Hood)
  • 10.1 Data Carving
  • 10.2 About the Data
  • 10.3 Cohen Pruning
  • 10.4 Discretization
  • 10.4.1 Other Discretization Methods
  • 10.5 Column Pruning
  • 10.6 Row Pruning
  • 10.7 Cluster Pruning
  • 10.7.1 Advantages of Prototypes
  • 10.7.2 Advantages of Clustering
  • 10.8 Contrast Pruning
  • 10.9 Goal Pruning
  • 10.10 Extensions for Continuous Classes
  • 10.10.1 How RTs Work
  • 10.10.2 Creating Splits for Categorical Input Features
  • 10.10.3 Splits on Numeric Input Features
  • 10.10.4 Termination Condition and Predictions
  • 10.10.5 Potential Advantages of RTs for Software Effort Estimation
  • 10.10.6 Predictions for Multiple Numeric Goals
  • Part III: Sharing Data
  • Chapter 11: Sharing Data: Challenges and Methods
  • 11.1 Houston, We Have a Problem
  • 11.2 Good News, Everyone
  • Chapter 12: Learning Contexts
  • 12.1 Background
  • 12.2 Manual Methods for Contextualization
  • 12.3 Automatic Methods
  • 12.4 Other Motivation to Find Contexts
  • 12.4.1 Variance Reduction
  • 12.4.2 Anomaly Detection
  • 12.4.3 Certification Envelopes
  • 12.4.4 Incremental Learning
  • 12.4.5 Compression
  • 12.4.6 Optimization
  • 12.5 How to Find Local Regions
  • 12.5.1 License
  • 12.5.2 Installing CHUNK
  • 12.5.3 Testing Your Installation
  • 12.5.4 Applying CHUNK to Other Models
  • 12.6 Inside CHUNK
  • 12.6.1 Roadmap to Functions
  • 12.6.2 Distance Calculations
  • 12.6.2.1 Normalize
  • 12.6.2.2 SquaredDifference
  • 12.6.3 Dividing the Data
  • 12.6.3.1 FastDiv
  • 12.6.3.2 TwoDistantPoints.
  • 12.6.3.3 Settings
  • 12.6.3.4 Chunk (main function)
  • 12.6.4 Support Utilities
  • 12.6.4.1 Some standard tricks
  • 12.6.4.2 Tree iterators
  • 12.6.4.3 Pretty printing
  • 12.7 Putting It all Together
  • 12.7.1 _nasa93
  • 12.8 Using CHUNK
  • 12.9 Closing Remarks
  • Chapter 13: Cross-Company Learning: Handling the Data Drought
  • 13.1 Motivation
  • 13.2 Setting the Ground for Analyses
  • 13.2.1 Wait … Is This Really CC Data?
  • 13.2.2 Mining the Data
  • 13.2.3 Magic Trick: NN Relevancy Filtering
  • 13.3 Analysis #1: Can CC Data be Useful for an Organization?
  • 13.3.1 Design
  • 13.3.2 Results from Analysis #1
  • 13.3.3 Checking the Analysis #1 Results
  • 13.3.4 Discussion of Analysis #1
  • 13.4 Analysis #2: How to Cleanup CC Data for Local Tuning?
  • 13.4.1 Design
  • 13.4.2 Results
  • 13.4.3 Discussions
  • 13.5 Analysis #3: How Much Local Data Does an Organization Need for a Local Model?
  • 13.5.1 Design
  • 13.5.2 Results from Analysis #3
  • 13.5.3 Checking the Analysis #3 Results
  • 13.5.4 Discussion of Analysis #3
  • 13.6 How Trustworthy Are These Results?
  • 13.7 Are These Useful in Practice or Just Number Crunching?
  • 13.8 What's New on Cross-Learning?
  • 13.8.1 Discussion
  • 13.9 What's the Takeaway?
  • Chapter 14: Building Smarter Transfer Learners
  • 14.1 What Is Actually the Problem?
  • 14.2 What Do We Know So Far?
  • 14.2.1 Transfer Learning
  • 14.2.2 Transfer Learning and SE
  • 14.2.3 Data Set Shift
  • 14.3 An Example Technology: TEAK
  • 14.4 The Details of the Experiments
  • 14.4.1 Performance Comparison
  • 14.4.2 Performance Measures
  • 14.4.3 Retrieval Tendency
  • 14.5 Results
  • 14.5.1 Performance Comparison
  • 14.5.2 Inspecting Selection Tendencies
  • 14.6 Discussion
  • 14.7 What Are the Takeaways?
  • Chapter 15: Sharing Less Data (Is a Good Thing)
  • 15.1 Can We Share Less Data?
  • 15.2 Using Less Data
  • 15.3 Why Share Less Data?.
  • 15.3.1 Less Data Is More Reliable
  • 15.3.2 Less Data Is Faster to Discuss
  • 15.3.3 Less Data Is Easier to Process
  • 15.4 How to Find Less Data
  • 15.4.1 Input
  • 15.4.2 Comparisons to Other Learners
  • 15.4.3 Reporting the Results
  • 15.4.4 Discussion of Results
  • 15.5 What's Next?
  • Chapter 16: How to Keep Your Data Private
  • 16.1 Motivation
  • 16.2 What Is PPDP and Why Is It Important?
  • 16.3 What Is Considered a Breach of Privacy?
  • 16.4 How to Avoid Privacy Breaches?
  • 16.4.1 Generalization and Suppression
  • 16.4.2 Anatomization and Permutation
  • 16.4.3 Perturbation
  • 16.4.4 Output Perturbation
  • 16.5 How Are Privacy-Preserving Algorithms Evaluated?
  • 16.5.1 Privacy Metrics
  • 16.5.2 Modeling the Background Knowledge of an Attacker
  • 16.6 Case Study: Privacy and Cross-Company Defect Prediction
  • 16.6.1 Results and Contributions
  • 16.6.2 Privacy and CCDP
  • 16.6.3 CLIFF
  • 16.6.4 MORPH
  • 16.6.5 Example of CLIFF&amp
  • MORPH
  • 16.6.6 Evaluation Metrics
  • 16.6.7 Evaluating Utility via Classification
  • 16.6.8 Evaluating Privatization
  • 16.6.8.1 Defining privacy
  • 16.6.9 Experiments
  • 16.6.9.1 Data
  • 16.6.10 Design
  • 16.6.11 Defect Predictors
  • 16.6.12 Query Generator
  • 16.6.13 Benchmark Privacy Algorithms
  • 16.6.14 Experimental Evaluation
  • 16.6.15 Discussion
  • 16.6.16 Related Work: Privacy in SE
  • 16.6.17 Summary
  • Chapter 17: Compensating for Missing Data
  • 17.1 Background Notes on SEE and Instance Selection
  • 17.1.1 Software Effort Estimation
  • 17.1.2 Instance Selection in SEE
  • 17.2 Data Sets and Performance Measures
  • 17.2.1 Data Sets
  • 17.2.2 Error Measures
  • 17.3 Experimental Conditions
  • 17.3.1 The Algorithms Adopted
  • 17.3.2 Proposed Method: POP1
  • 17.3.3 Experiments
  • 17.4 Results
  • 17.4.1 Results Without Instance Selection
  • 17.4.2 Results with Instance Selection
  • 17.5 Summary.
  • Chapter 18: Active Learning: Learning More with Less
  • 18.1 How Does the QUICK Algorithm Work?
  • 18.1.1 Getting Rid of Similar Features: Synonym Pruning
  • 18.1.2 Getting Rid of Dissimilar Instances: Outlier Pruning
  • 18.2 Notes on Active Learning
  • 18.3 The Application and Implementation Details of QUICK
  • 18.3.1 Phase 1: Synonym Pruning
  • 18.3.2 Phase 2: Outlier Removal and Estimation
  • 18.3.3 Seeing QUICK in Action with a Toy Example
  • 18.3.3.1 Phase 1: Synonym pruning
  • 18.3.3.2 Phase 2: Outlier removal and estimation
  • 18.4 How the Experiments Are Designed
  • 18.5 Results
  • 18.5.1 Performance
  • 18.5.2 Reduction via Synonym and Outlier Pruning
  • 18.5.3 Comparison of QUICK vs. CART
  • 18.5.4 Detailed Look at the Statistical Analysis
  • 18.5.5 Early Results on Defect Data Sets
  • 18.6 Summary
  • Part IV: Sharing Models
  • Chapter 19: Sharing Models: Challenges and Methods
  • Chapter 20: Ensembles of Learning Machines
  • 20.1 When and Why Ensembles Work
  • 20.1.1 Intuition
  • 20.1.2 Theoretical Foundation
  • 20.2 Bootstrap Aggregating (Bagging)
  • 20.2.1 How Bagging Works
  • 20.2.2 When and Why Bagging Works
  • 20.2.3 Potential Advantages of Bagging for SEE
  • 20.3 Regression Trees (RTs) for Bagging
  • 20.4 Evaluation Framework
  • 20.4.1 Choice of Data Sets and Preprocessing Techniques
  • 20.4.1.1 PROMISE data
  • 20.4.1.2 ISBSG data
  • 20.4.2 Choice of Learning Machines
  • 20.4.3 Choice of Evaluation Methods
  • 20.4.4 Choice of Parameters
  • 20.5 Evaluation of Bagging+RTs in SEE
  • 20.5.1 Friedman Ranking
  • 20.5.2 Approaches Most Often Ranked First or Second in Terms of MAE, MMRE and PRED(25)
  • 20.5.3 Magnitude of Performance Against the Best
  • 20.5.4 Discussion
  • 20.6 Further Understanding of Bagging+RTs in SEE
  • 20.7 Summary
  • Chapter 21: How to Adapt Models in a Dynamic World
  • 21.1 Cross-Company Data and Questions Tackled.
  • 21.2 Related Work.