Cleaning data for effective data science doing the other 80% of the work with Python, R, and command-line tools

Data in its raw state is rarely ready for productive analysis. This book not only teaches you data preparation, but also what questions you should ask of your data. It focuses on the thought processes necessary for successful data cleaning as much as on concise and precise code examples that express...

Descripción completa

Detalles Bibliográficos
Otros Autores:	Mertz, David, (author) (author)
Formato:	Libro electrónico
Idioma:	Inglés
Publicado:	Birmingham, England ; Mumbai : Packt Publishing [2021]
Materias:	Database management. Data integrity.
Ver en Biblioteca Universitat Ramon Llull:	https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009631706506719

Tabla de Contenidos:

Cover
Copyright
Contributors
Table of Contents
Preface
Part I - Data Ingestion
Chapter 1: Tabular Formats
Tidying Up
CSV
Sanity Checks
The Good, the Bad, and the Textual Data
The Bad
The Good
Spreadsheets Considered Harmful
SQL RDBMS
Massaging Data Types
Repeating in R
Where SQL Goes Wrong (and How to Notice It)
Other Formats
HDF5 and NetCDF-4
Tools and Libraries
SQLite
Apache Parquet
Data Frames
Spark/Scala
Pandas and Derived Wrappers
Vaex
Data Frames in R (Tidyverse)
Data Frames in R (data.table)
Bash for Fun
Exercises
Tidy Data from Excel
Tidy Data from SQL
Denouement
Chapter 2: Hierarchical Formats
JSON
What JSON Looks Like
NaN Handling and Data Types
JSON Lines
GeoJSON
Tidy Geography
JSON Schema
XML
User Records
Keyhole Markup Language
Configuration Files
INI and Flat Custom Formats
TOML
Yet Another Markup Language
NoSQL Databases
Document-Oriented Databases
Missing Fields
Denormalization and Its Discontents
Key/Value Stores
Exercises
Exploring Filled Area
Create a Relational Model
Denouement
Chapter 3: Repurposing Data Sources
Web Scraping
HTML Tables
Non-Tabular Data
Command-Line Scraping
Portable Document Format
Image Formats
Pixel Statistics
Channel Manipulation
Metadata
Binary Serialized Data Structures
Custom Text Formats
A Structured Log
Character Encodings
Exercises
Enhancing the NPY Parser
Scraping Web Traffic
Denouement
Part II - The Vicissitudes of Error
Chapter 4: Anomaly Detection
Missing Data
SQL
Hierarchical Formats
Sentinels
Miscoded Data
Fixed Bounds
Outliers
Z-Score
Interquartile Range
Multivariate Outliers
Exercises
A Famous Experiment
Misspelled Words.
Denouement
Chapter 5: Data Quality
Missing Data
Biasing Trends
Understanding Bias
Detecting Bias
Comparison to Baselines
Benford's Law
Class Imbalance
Normalization and Scaling
Applying a Machine Learning Model
Scaling Techniques
Factor and Sample Weighting
Cyclicity and Autocorrelation
Domain Knowledge Trends
Discovered Cycles
Bespoke Validation
Collation Validation
Transcription Validation
Exercises
Data Characterization
Oversampled Polls
Denouement
Part III - Rectification and Creation
Chapter 6: Value Imputation
Typical-Value Imputation
Typical Tabular Data
Locality Imputation
Trend Imputation
Types of Trends
A Larger Coarse Time Series
Understanding the Data
Removing Unusable Data
Imputing Consistency
Interpolation
Non-Temporal Trends
Sampling
Undersampling
Oversampling
Exercises
Alternate Trend Imputation
Balancing Multiple Features
Denouement
Chapter 7: Feature Engineering
Date/Time Fields
Creating Datetimes
Imposing Regularity
Duplicated Timestamps
Adding Timestamps
String Fields
Fuzzy Matching
Explicit Categories
String Vectors
Decompositions
Rotation and Whitening
Dimensionality Reduction
Visualization
Quantization and Binarization
One-Hot Encoding
Polynomial Features
Generating Synthetic Features
Feature Selection
Exercises
Intermittent Occurrences
Characterizing Levels
Denouement
Part IV - Ancillary Matters
Closure
What You Know
What You Don't Know (Yet)
Glossary
Other Books You May Enjoy
Index.

Cleaning data for effective data science doing the other 80% of the work with Python, R, and command-line tools

Ejemplares similares