Cleaning data for effective data science doing the other 80% of the work with Python, R, and command-line tools
Data in its raw state is rarely ready for productive analysis. This book not only teaches you data preparation, but also what questions you should ask of your data. It focuses on the thought processes necessary for successful data cleaning as much as on concise and precise code examples that express...
Otros Autores: | |
---|---|
Formato: | Libro electrónico |
Idioma: | Inglés |
Publicado: |
Birmingham, England ; Mumbai :
Packt Publishing
[2021]
|
Materias: | |
Ver en Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009631706506719 |
Tabla de Contenidos:
- Cover
- Copyright
- Contributors
- Table of Contents
- Preface
- Part I - Data Ingestion
- Chapter 1: Tabular Formats
- Tidying Up
- CSV
- Sanity Checks
- The Good, the Bad, and the Textual Data
- The Bad
- The Good
- Spreadsheets Considered Harmful
- SQL RDBMS
- Massaging Data Types
- Repeating in R
- Where SQL Goes Wrong (and How to Notice It)
- Other Formats
- HDF5 and NetCDF-4
- Tools and Libraries
- SQLite
- Apache Parquet
- Data Frames
- Spark/Scala
- Pandas and Derived Wrappers
- Vaex
- Data Frames in R (Tidyverse)
- Data Frames in R (data.table)
- Bash for Fun
- Exercises
- Tidy Data from Excel
- Tidy Data from SQL
- Denouement
- Chapter 2: Hierarchical Formats
- JSON
- What JSON Looks Like
- NaN Handling and Data Types
- JSON Lines
- GeoJSON
- Tidy Geography
- JSON Schema
- XML
- User Records
- Keyhole Markup Language
- Configuration Files
- INI and Flat Custom Formats
- TOML
- Yet Another Markup Language
- NoSQL Databases
- Document-Oriented Databases
- Missing Fields
- Denormalization and Its Discontents
- Key/Value Stores
- Exercises
- Exploring Filled Area
- Create a Relational Model
- Denouement
- Chapter 3: Repurposing Data Sources
- Web Scraping
- HTML Tables
- Non-Tabular Data
- Command-Line Scraping
- Portable Document Format
- Image Formats
- Pixel Statistics
- Channel Manipulation
- Metadata
- Binary Serialized Data Structures
- Custom Text Formats
- A Structured Log
- Character Encodings
- Exercises
- Enhancing the NPY Parser
- Scraping Web Traffic
- Denouement
- Part II - The Vicissitudes of Error
- Chapter 4: Anomaly Detection
- Missing Data
- SQL
- Hierarchical Formats
- Sentinels
- Miscoded Data
- Fixed Bounds
- Outliers
- Z-Score
- Interquartile Range
- Multivariate Outliers
- Exercises
- A Famous Experiment
- Misspelled Words.
- Denouement
- Chapter 5: Data Quality
- Missing Data
- Biasing Trends
- Understanding Bias
- Detecting Bias
- Comparison to Baselines
- Benford's Law
- Class Imbalance
- Normalization and Scaling
- Applying a Machine Learning Model
- Scaling Techniques
- Factor and Sample Weighting
- Cyclicity and Autocorrelation
- Domain Knowledge Trends
- Discovered Cycles
- Bespoke Validation
- Collation Validation
- Transcription Validation
- Exercises
- Data Characterization
- Oversampled Polls
- Denouement
- Part III - Rectification and Creation
- Chapter 6: Value Imputation
- Typical-Value Imputation
- Typical Tabular Data
- Locality Imputation
- Trend Imputation
- Types of Trends
- A Larger Coarse Time Series
- Understanding the Data
- Removing Unusable Data
- Imputing Consistency
- Interpolation
- Non-Temporal Trends
- Sampling
- Undersampling
- Oversampling
- Exercises
- Alternate Trend Imputation
- Balancing Multiple Features
- Denouement
- Chapter 7: Feature Engineering
- Date/Time Fields
- Creating Datetimes
- Imposing Regularity
- Duplicated Timestamps
- Adding Timestamps
- String Fields
- Fuzzy Matching
- Explicit Categories
- String Vectors
- Decompositions
- Rotation and Whitening
- Dimensionality Reduction
- Visualization
- Quantization and Binarization
- One-Hot Encoding
- Polynomial Features
- Generating Synthetic Features
- Feature Selection
- Exercises
- Intermittent Occurrences
- Characterizing Levels
- Denouement
- Part IV - Ancillary Matters
- Closure
- What You Know
- What You Don't Know (Yet)
- Glossary
- Other Books You May Enjoy
- Index.