Machine Learning for Imbalanced Data Tackle Imbalanced Datasets Using Machine Learning and Deep Learning Techniques

As machine learning practitioners, we often encounter imbalanced datasets in which one class has considerably fewer instances than the other. Many machine learning algorithms assume an equilibrium between majority and minority classes, leading to suboptimal performance on imbalanced data. This compr...

Descripción completa

Detalles Bibliográficos
Otros Autores: Abhishek, Kumar, author (author), Abdelaziz, Mounir, author
Formato: Libro electrónico
Idioma:Inglés
Publicado: Birmingham, England : Packt Publishing Ltd [2023]
Edición:First edition
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009827938806719
Tabla de Contenidos:
  • Cover
  • Copyright
  • Contributors
  • Table of Contents
  • Preface
  • Chapter 1: Introduction to Data Imbalance in Machine Learning
  • Technical requirements
  • Introduction to imbalanced datasets
  • Machine learning 101
  • What happens during model training?
  • Types of dataset and splits
  • Cross-validation
  • Common evaluation metrics
  • Confusion matrix
  • ROC
  • Precision-Recall curve
  • Relation between the ROC curve and PR curve
  • Challenges and considerations when dealing with imbalanced data
  • When can we have an imbalance in datasets?
  • Why can imbalanced data be a challenge?
  • When to not worry about data imbalance
  • Introduction to the imbalanced-learn library
  • General rules to follow
  • Summary
  • Questions
  • References
  • Chapter 2: Oversampling Methods
  • Technical requirements
  • What is oversampling?
  • Random oversampling
  • Problems with random oversampling
  • SMOTE
  • How SMOTE works
  • Problems with SMOTE
  • SMOTE variants
  • Borderline-SMOTE
  • ADASYN
  • Working of ADASYN
  • Categorical features and SMOTE variants (SMOTE-NC and SMOTEN)
  • Model performance comparison of various oversampling methods
  • Guidance for using various oversampling techniques
  • When to avoid oversampling
  • Oversampling in multi-class classification
  • Summary
  • Exercises
  • References
  • Chapter 3: Undersampling Methods
  • Technical requirements
  • Introducing undersampling
  • When to avoid undersampling the majority class
  • Fixed versus cleaning undersampling
  • Undersampling approaches
  • Removing examples uniformly
  • Random UnderSampling
  • ClusterCentroids
  • Strategies for removing noisy observations
  • ENN, RENN, and AllKNN
  • Tomek links
  • Neighborhood Cleaning Rule
  • Instance hardness threshold
  • Strategies for removing easy observations
  • Condensed Nearest Neighbors
  • One-sided selection.
  • Combining undersampling and oversampling
  • Model performance comparison
  • Summary
  • Exercises
  • References
  • Chapter 4: Ensemble Methods
  • Technical requirements
  • Bagging techniques for imbalanced data
  • UnderBagging
  • OverBagging
  • SMOTEBagging
  • Comparative performance of bagging methods
  • Boosting techniques for imbalanced data
  • AdaBoost
  • RUSBoost, SMOTEBoost, and RAMOBoost
  • Ensemble of ensembles
  • EasyEnsemble
  • Comparative performance of boosting methods
  • Model performance comparison
  • Summary
  • Questions
  • References
  • Chapter 5: Cost-Sensitive Learning
  • Technical requirements
  • The concept of Cost-Sensitive Learning
  • Costs and cost functions
  • Types of cost-sensitive learning
  • Difference between CSL and resampling
  • Problems with rebalancing techniques
  • Understanding costs in practice
  • Cost-Sensitive Learning for logistic regression
  • Cost-Sensitive Learning for decision trees
  • Cost-Sensitive Learning using scikit-learn and XGBoost models
  • MetaCost - making any classification model cost-sensitive
  • Threshold adjustment
  • Methods for threshold tuning
  • Summary
  • Questions
  • References
  • Chapter 6: Data Imbalance in Deep Learning
  • Technical requirements
  • A brief introduction to deep learning
  • Neural networks
  • Perceptron
  • Activation functions
  • Layers
  • Feedforward neural networks
  • Training neural networks
  • The effect of the learning rate on data imbalance
  • Image processing using Convolutional Neural Networks
  • Text analysis using Natural Language Processing
  • Data imbalance in deep learning
  • The impact of data imbalance on deep learning models
  • Overview of deep learning techniques to handle data imbalance
  • Multi-label classification
  • Summary
  • Questions
  • References
  • Chapter 7: Data-Level Deep Learning Methods
  • Technical requirements
  • Preparing the data.
  • Creating the training loop
  • Sampling techniques for deep learning models
  • Random oversampling
  • Dynamic sampling
  • Data augmentation techniques for vision
  • Data-level techniques for text classification
  • Dataset and baseline model
  • Document-level augmentation
  • Character and word-level augmentation
  • Discussion of other data-level deep learning methods and their key ideas
  • Two-phase learning
  • Expansive Over-Sampling
  • Using generative models for oversampling
  • DeepSMOTE
  • Neural style transfer
  • Summary
  • Questions
  • References
  • Chapter 8: Algorithm-Level Deep Learning Techniques
  • Technical requirements
  • Motivation for algorithm-level techniques
  • Weighting techniques
  • Using PyTorch's weight parameter
  • Handling textual data
  • Deferred re-weighting - a minor variant of the class weighting technique
  • Explicit loss function modification
  • Focal loss
  • Class-balanced loss
  • Class-dependent temperature Loss
  • Class-wise difficulty-balanced loss
  • Discussing other algorithm-based techniques
  • Regularization techniques
  • Siamese networks
  • Deeper neural networks
  • Threshold adjustment
  • Summary
  • Questions
  • References
  • Chapter 9: Hybrid Deep Learning Methods
  • Technical requirements
  • Using graph machine learning for imbalanced data
  • Understanding graphs
  • Graph machine learning
  • Dealing with imbalanced data
  • Case study - the performance of XGBoost, MLP, and a GCN on an imbalanced dataset
  • Hard example mining
  • Online Hard Example Mining
  • Minority class incremental rectification
  • Utilizing the hard sample mining technique in minority class incremental rectification
  • Summary
  • Questions
  • References
  • Chapter 10: Model Calibration
  • Technical requirements
  • Introduction to model calibration
  • Why bother with model calibration
  • Models with and without well-calibrated probabilities.
  • Calibration curves or reliability plot
  • Brier score
  • Expected Calibration Error
  • The influence of data balancing techniques on model calibration
  • Plotting calibration curves for a model trained on a real-world dataset
  • Model calibration techniques
  • The calibration of model scores to account for sampling
  • Platt's scaling
  • Isotonic regression
  • Choosing between Platt's scaling and Isotonic regression
  • Temperature scaling
  • Label smoothing
  • The impact of calibration on a model's performance
  • Summary
  • Questions
  • References
  • Appendix: Machine Learning Pipeline in Production
  • Machine learning training pipeline
  • Inferencing (online or batch)
  • Assessments
  • Chapter 1 - Introduction to Data Imbalance in Machine Learning
  • Chapter 2 - Oversampling Methods
  • Chapter 3 - Undersampling Methods
  • Chapter 4 - Ensemble Methods
  • Chapter 5 - Cost-Sensitive Learning
  • Chapter 6 - Data Imbalance in Deep Learning
  • Chapter 7 - Data-Level Deep Learning Methods
  • Chapter 8 - Algorithm-Level Deep Learning Techniques
  • Chapter 9 - Hybrid Deep Learning Methods
  • Chapter 10 - Model Calibration
  • Index
  • Other Books You May Enjoy.