The data science handbook
Otros Autores: | |
---|---|
Formato: | Libro electrónico |
Idioma: | Inglés |
Publicado: |
Hoboken, New Jersey
John Wiley & Sons, Incorporated
2017
|
Edición: | 1st ed |
Colección: | THEi Wiley ebooks.
|
Materias: | |
Ver en Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009849125506719 |
Tabla de Contenidos:
- Cover
- Title Page
- Copyright
- Dedication
- Contents
- Preface
- Chapter 1 Introduction: Becoming a Unicorn
- 1.1 Aren't Data Scientists Just Overpaid Statisticians?
- 1.2 How Is This Book Organized?
- 1.3 How to Use This Book?
- 1.4 Why Is It All in Python™, Anyway?
- 1.5 Example Code and Datasets
- 1.6 Parting Words
- Part 1 The Stuff You'll Always Use
- Chapter 2 The Data Science Road Map
- 2.1 Frame the Problem
- 2.2 Understand the Data: Basic Questions
- 2.3 Understand the Data: Data Wrangling
- 2.4 Understand the Data: Exploratory Analysis
- 2.5 Extract Features
- 2.6 Model
- 2.7 Present Results
- 2.8 Deploy Code
- 2.9 Iterating
- 2.10 Glossary
- Chapter 3 Programming Languages
- 3.1 Why Use a Programming Language? What Are the Other Options?
- 3.2 A Survey of Programming Languages for Data Science
- 3.3 Python Crash Course
- 3.4 Strings
- 3.5 Defining Functions
- 3.6 Python's Technical Libraries
- 3.7 Other Python Resources
- 3.8 Further Reading
- 3.9 Glossary
- Interlude: My Personal Toolkit
- Chapter 4 Data Munging: String Manipulation, Regular Expressions, and Data Cleaning
- 4.1 The Worst Dataset in the World
- 4.2 How to Identify Pathologies
- 4.3 Problems with Data Content
- 4.4 Formatting Issues
- 4.5 Example Formatting Script
- 4.6 Regular Expressions
- 4.7 Life in the Trenches
- 4.8 Glossary
- Chapter 5 Visualizations and Simple Metrics
- 5.1 A Note on Python's Visualization Tools
- 5.2 Example Code
- 5.3 Pie Charts
- 5.4 Bar Charts
- 5.5 Histograms
- 5.6 Means, Standard Deviations, Medians, and Quantiles
- 5.7 Boxplots
- 5.8 Scatterplots
- 5.9 Scatterplots with Logarithmic Axes
- 5.10 Scatter Matrices
- 5.11 Heatmaps
- 5.12 Correlations
- 5.13 Anscombe's Quartet and the Limits of Numbers
- 5.14 Time Series
- 5.15 Further Reading
- 5.16 Glossary.
- Chapter 6 Machine Learning Overview
- 6.1 Historical Context
- 6.2 Supervised versus Unsupervised
- 6.3 Training Data, Testing Data, and the Great Boogeyman of Overfitting
- 6.4 Further Reading
- 6.5 Glossary
- Chapter 7 Interlude: Feature Extraction Ideas
- 7.1 Standard Features
- 7.2 Features That Involve Grouping
- 7.3 Preview of More Sophisticated Features
- 7.4 Defining the Feature You Want to Predict
- Chapter 8 Machine Learning Classification
- 8.1 What Is a Classifier, and What Can You Do with It?
- 8.2 A Few Practical Concerns
- 8.3 Binary versus Multiclass
- 8.4 Example Script
- 8.5 Specific Classifiers
- 8.6 Evaluating Classifiers
- 8.7 Selecting Classification Cutoffs
- 8.8 Further Reading
- 8.9 Glossary
- Chapter 9 Technical Communication and Documentation
- 9.1 Several Guiding Principles
- 9.2 Slide Decks
- 9.3 Written Reports
- 9.4 Speaking: What Has Worked for Me
- 9.5 Code Documentation
- 9.6 Further Reading
- 9.7 Glossary
- Part II Stuff You Still Need to Know
- Chapter 10 Unsupervised Learning: Clustering and Dimensionality Reduction
- 10.1 The Curse of Dimensionality
- 10.2 Example: Eigenfaces for Dimensionality Reduction
- 10.3 Principal Component Analysis and Factor Analysis
- 10.4 Skree Plots and Understanding Dimensionality
- 10.5 Factor Analysis
- 10.6 Limitations of PCA
- 10.7 Clustering
- 10.8 Further Reading
- 10.9 Glossary
- Chapter 11 Regression
- 11.1 Example: Predicting Diabetes Progression
- 11.2 Least Squares
- 11.3 Fitting Nonlinear Curves
- 11.4 Goodness of Fit: R2 and Correlation
- 11.5 Correlation of Residuals
- 11.6 Linear Regression
- 11.7 LASSO Regression and Feature Selection
- 11.8 Further Reading
- 11.9 Glossary
- Chapter 12 Data Encodings and File Formats
- 12.1 Typical File Format Categories
- 12.2 CSV Files
- 12.3 JSON Files
- 12.4 XML Files.
- 12.5 HTML Files
- 12.6 Tar Files
- 12.7 GZip Files
- 12.8 Zip Files
- 12.9 Image Files: Rasterized, Vectorized, and/or Compressed
- 12.10 It's All Bytes at the End of the Day
- 12.11 Integers
- 12.12 Floats
- 12.13 Text Data
- 12.14 Further Reading
- 12.15 Glossary
- Chapter 13 Big Data
- 13.1 What Is Big Data?
- 13.2 Hadoop: The File System and the Processor
- 13.3 Using HDFS
- 13.4 Example PySpark Script
- 13.5 Spark Overview
- 13.6 Spark Operations
- 13.7 Two Ways to Run PySpark
- 13.8 Configuring Spark
- 13.9 Under the Hood
- 13.10 Spark Tips and Gotchas
- 13.11 The MapReduce Paradigm
- 13.12 Performance Considerations
- 13.13 Further Reading
- 13.14 Glossary
- Chapter 14 Databases
- 14.1 Relational Databases and MySQL®
- 14.2 Key-Value Stores
- 14.3 Wide Column Stores
- 14.4 Document Stores
- 14.5 Further Reading
- 14.6 Glossary
- Chapter 15 Software Engineering Best Practices
- 15.1 Coding Style
- 15.2 Version Control and Git for Data Scientists
- 15.3 Testing Code
- 15.4 Test-Driven Development
- 15.5 AGILE Methodology
- 15.6 Further Reading
- 15.7 Glossary
- Chapter 16 Natural Language Processing
- 16.1 Do I Even Need NLP?
- 16.2 The Great Divide: Language versus Statistics
- 16.3 Example: Sentiment Analysis on Stock Market Articles
- 16.4 Software and Datasets
- 16.5 Tokenization
- 16.6 Central Concept: Bag-of-Words
- 16.7 Word Weighting: TF-IDF
- 16.8 n-Grams
- 16.9 Stop Words
- 16.10 Lemmatization and Stemming
- 16.11 Synonyms
- 16.12 Part of Speech Tagging
- 16.13 Common Problems
- 16.14 Advanced NLP: Syntax Trees, Knowledge, and Understanding
- 16.15 Further Reading
- 16.16 Glossary
- Chapter 17 Time Series Analysis
- 17.1 Example: Predicting Wikipedia Page Views
- 17.2 A Typical Workflow
- 17.3 Time Series versus Time-Stamped Events
- 17.4 Resampling an Interpolation.
- 17.5 Smoothing Signals
- 17.6 Logarithms and Other Transformations
- 17.7 Trends and Periodicity
- 17.8 Windowing
- 17.9 Brainstorming Simple Features
- 17.10 Better Features: Time Series as Vectors
- 17.11 Fourier Analysis: Sometimes a Magic Bullet
- 17.12 Time Series in Context: The Whole Suite of Features
- 17.13 Further Reading
- 17.14 Glossary
- Chapter 18 Probability
- 18.1 Flipping Coins: Bernoulli Random Variables
- 18.2 Throwing Darts: Uniform Random Variables
- 18.3 The Uniform Distribution and Pseudorandom Numbers
- 18.4 Nondiscrete, Noncontinuous Random Variables
- 18.5 Notation, Expectations, and Standard Deviation
- 18.6 Dependence, Marginal and Conditional Probability
- 18.7 Understanding the Tails
- 18.8 Binomial Distribution
- 18.9 Poisson Distribution
- 18.10 Normal Distribution
- 18.11 Multivariate Gaussian
- 18.12 Exponential Distribution
- 18.13 Log-Normal Distribution
- 18.14 Entropy
- 18.15 Further Reading
- 18.16 Glossary
- Chapter 19 Statistics
- 19.1 Statistics in Perspective
- 19.2 Bayesian versus Frequentist: Practical Tradeoffs and Differing Philosophies
- 19.3 Hypothesis Testing: Key Idea and Example
- 19.4 Multiple Hypothesis Testing
- 19.5 Parameter Estimation
- 19.6 Hypothesis Testing: t-Test
- 19.7 Confidence Intervals
- 19.8 Bayesian Statistics
- 19.9 Naive Bayesian Statistics
- 19.10 Bayesian Networks
- 19.11 Choosing Priors: Maximum Entropy or Domain Knowledge
- 19.12 Further Reading
- 19.13 Glossary
- Chapter 20 Programming Language Concepts
- 20.1 Programming Paradigms
- 20.2 Compilation and Interpretation
- 20.3 Type Systems
- 20.4 Further Reading
- 20.5 Glossary
- Chapter 21 Performance and Computer Memory
- 21.1 Example Script
- 21.2 Algorithm Performance and Big-O Notation
- 21.3 Some Classic Problems: Sorting a List and Binary Search.
- 21.4 Amortized Performance and Average Performance
- 21.5 Two Principles: Reducing Overhead and Managing Memory
- 21.6 Performance Tip: Use Numerical Libraries When Applicable
- 21.7 Performance Tip: Delete Large Structures You Don't Need
- 21.8 Performance Tip: Use Built-In Functions When Possible
- 21.9 Performance Tip: Avoid Superfluous Function Calls
- 21.10 Performance Tip: Avoid Creating Large New Objects
- 21.11 Further Reading
- 21.12 Glossary
- Part III Specialized or Advanced Topics
- Chapter 22 Computer Memory and Data Structures
- 22.1 Virtual Memory, the Stack, and the Heap
- 22.2 Example C Program
- 22.3 Data Types and Arrays in Memory
- 22.4 Structs
- 22.5 Pointers, the Stack, and the Heap
- 22.6 Key Data Structures
- 22.7 Further Reading
- 22.8 Glossary
- Chapter 23 Maximum Likelihood Estimation and Optimization
- 23.1 Maximum Likelihood Estimation
- 23.2 A Simple Example: Fitting a Line
- 23.3 Another Example: Logistic Regression
- 23.4 Optimization
- 23.5 Gradient Descent and Convex Optimization
- 23.6 Convex Optimization
- 23.7 Stochastic Gradient Descent
- 23.8 Further Reading
- 23.9 Glossary
- Chapter 24 Advanced Classifiers
- 24.1 A Note on Libraries
- 24.2 Basic Deep Learning
- 24.3 Convolutional Neural Networks
- 24.4 Different Types of Layers. What the Heck Is a Tensor?
- 24.5 Example: The MNIST Handwriting Dataset
- 24.6 Recurrent Neural Networks
- 24.7 Bayesian Networks
- 24.8 Training and Prediction
- 24.9 Markov Chain Monte Carlo
- 24.10 PyMC Example
- 24.11 Further Reading
- 24.12 Glossary
- Chapter 25 Stochastic Modeling
- 25.1 Markov Chains
- 25.2 Two Kinds of Markov Chain, Two Kinds of Questions
- 25.3 Markov Chain Monte Carlo
- 25.4 Hidden Markov Models and the Viterbi Algorithm
- 25.5 The Viterbi Algorithm
- 25.6 Random Walks
- 25.7 Brownian Motion.
- 25.8 ARIMA Models.