The data science handbook

Detalles Bibliográficos
Otros Autores: Cady, Field, 1984- author (author)
Formato: Libro electrónico
Idioma:Inglés
Publicado: Hoboken, New Jersey John Wiley & Sons, Incorporated 2017
Edición:1st ed
Colección:THEi Wiley ebooks.
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009849125506719
Tabla de Contenidos:
  • Cover
  • Title Page
  • Copyright
  • Dedication
  • Contents
  • Preface
  • Chapter 1 Introduction: Becoming a Unicorn
  • 1.1 Aren't Data Scientists Just Overpaid Statisticians?
  • 1.2 How Is This Book Organized?
  • 1.3 How to Use This Book?
  • 1.4 Why Is It All in Python™, Anyway?
  • 1.5 Example Code and Datasets
  • 1.6 Parting Words
  • Part 1 The Stuff You'll Always Use
  • Chapter 2 The Data Science Road Map
  • 2.1 Frame the Problem
  • 2.2 Understand the Data: Basic Questions
  • 2.3 Understand the Data: Data Wrangling
  • 2.4 Understand the Data: Exploratory Analysis
  • 2.5 Extract Features
  • 2.6 Model
  • 2.7 Present Results
  • 2.8 Deploy Code
  • 2.9 Iterating
  • 2.10 Glossary
  • Chapter 3 Programming Languages
  • 3.1 Why Use a Programming Language? What Are the Other Options?
  • 3.2 A Survey of Programming Languages for Data Science
  • 3.3 Python Crash Course
  • 3.4 Strings
  • 3.5 Defining Functions
  • 3.6 Python's Technical Libraries
  • 3.7 Other Python Resources
  • 3.8 Further Reading
  • 3.9 Glossary
  • Interlude: My Personal Toolkit
  • Chapter 4 Data Munging: String Manipulation, Regular Expressions, and Data Cleaning
  • 4.1 The Worst Dataset in the World
  • 4.2 How to Identify Pathologies
  • 4.3 Problems with Data Content
  • 4.4 Formatting Issues
  • 4.5 Example Formatting Script
  • 4.6 Regular Expressions
  • 4.7 Life in the Trenches
  • 4.8 Glossary
  • Chapter 5 Visualizations and Simple Metrics
  • 5.1 A Note on Python's Visualization Tools
  • 5.2 Example Code
  • 5.3 Pie Charts
  • 5.4 Bar Charts
  • 5.5 Histograms
  • 5.6 Means, Standard Deviations, Medians, and Quantiles
  • 5.7 Boxplots
  • 5.8 Scatterplots
  • 5.9 Scatterplots with Logarithmic Axes
  • 5.10 Scatter Matrices
  • 5.11 Heatmaps
  • 5.12 Correlations
  • 5.13 Anscombe's Quartet and the Limits of Numbers
  • 5.14 Time Series
  • 5.15 Further Reading
  • 5.16 Glossary.
  • Chapter 6 Machine Learning Overview
  • 6.1 Historical Context
  • 6.2 Supervised versus Unsupervised
  • 6.3 Training Data, Testing Data, and the Great Boogeyman of Overfitting
  • 6.4 Further Reading
  • 6.5 Glossary
  • Chapter 7 Interlude: Feature Extraction Ideas
  • 7.1 Standard Features
  • 7.2 Features That Involve Grouping
  • 7.3 Preview of More Sophisticated Features
  • 7.4 Defining the Feature You Want to Predict
  • Chapter 8 Machine Learning Classification
  • 8.1 What Is a Classifier, and What Can You Do with It?
  • 8.2 A Few Practical Concerns
  • 8.3 Binary versus Multiclass
  • 8.4 Example Script
  • 8.5 Specific Classifiers
  • 8.6 Evaluating Classifiers
  • 8.7 Selecting Classification Cutoffs
  • 8.8 Further Reading
  • 8.9 Glossary
  • Chapter 9 Technical Communication and Documentation
  • 9.1 Several Guiding Principles
  • 9.2 Slide Decks
  • 9.3 Written Reports
  • 9.4 Speaking: What Has Worked for Me
  • 9.5 Code Documentation
  • 9.6 Further Reading
  • 9.7 Glossary
  • Part II Stuff You Still Need to Know
  • Chapter 10 Unsupervised Learning: Clustering and Dimensionality Reduction
  • 10.1 The Curse of Dimensionality
  • 10.2 Example: Eigenfaces for Dimensionality Reduction
  • 10.3 Principal Component Analysis and Factor Analysis
  • 10.4 Skree Plots and Understanding Dimensionality
  • 10.5 Factor Analysis
  • 10.6 Limitations of PCA
  • 10.7 Clustering
  • 10.8 Further Reading
  • 10.9 Glossary
  • Chapter 11 Regression
  • 11.1 Example: Predicting Diabetes Progression
  • 11.2 Least Squares
  • 11.3 Fitting Nonlinear Curves
  • 11.4 Goodness of Fit: R2 and Correlation
  • 11.5 Correlation of Residuals
  • 11.6 Linear Regression
  • 11.7 LASSO Regression and Feature Selection
  • 11.8 Further Reading
  • 11.9 Glossary
  • Chapter 12 Data Encodings and File Formats
  • 12.1 Typical File Format Categories
  • 12.2 CSV Files
  • 12.3 JSON Files
  • 12.4 XML Files.
  • 12.5 HTML Files
  • 12.6 Tar Files
  • 12.7 GZip Files
  • 12.8 Zip Files
  • 12.9 Image Files: Rasterized, Vectorized, and/or Compressed
  • 12.10 It's All Bytes at the End of the Day
  • 12.11 Integers
  • 12.12 Floats
  • 12.13 Text Data
  • 12.14 Further Reading
  • 12.15 Glossary
  • Chapter 13 Big Data
  • 13.1 What Is Big Data?
  • 13.2 Hadoop: The File System and the Processor
  • 13.3 Using HDFS
  • 13.4 Example PySpark Script
  • 13.5 Spark Overview
  • 13.6 Spark Operations
  • 13.7 Two Ways to Run PySpark
  • 13.8 Configuring Spark
  • 13.9 Under the Hood
  • 13.10 Spark Tips and Gotchas
  • 13.11 The MapReduce Paradigm
  • 13.12 Performance Considerations
  • 13.13 Further Reading
  • 13.14 Glossary
  • Chapter 14 Databases
  • 14.1 Relational Databases and MySQL®
  • 14.2 Key-Value Stores
  • 14.3 Wide Column Stores
  • 14.4 Document Stores
  • 14.5 Further Reading
  • 14.6 Glossary
  • Chapter 15 Software Engineering Best Practices
  • 15.1 Coding Style
  • 15.2 Version Control and Git for Data Scientists
  • 15.3 Testing Code
  • 15.4 Test-Driven Development
  • 15.5 AGILE Methodology
  • 15.6 Further Reading
  • 15.7 Glossary
  • Chapter 16 Natural Language Processing
  • 16.1 Do I Even Need NLP?
  • 16.2 The Great Divide: Language versus Statistics
  • 16.3 Example: Sentiment Analysis on Stock Market Articles
  • 16.4 Software and Datasets
  • 16.5 Tokenization
  • 16.6 Central Concept: Bag-of-Words
  • 16.7 Word Weighting: TF-IDF
  • 16.8 n-Grams
  • 16.9 Stop Words
  • 16.10 Lemmatization and Stemming
  • 16.11 Synonyms
  • 16.12 Part of Speech Tagging
  • 16.13 Common Problems
  • 16.14 Advanced NLP: Syntax Trees, Knowledge, and Understanding
  • 16.15 Further Reading
  • 16.16 Glossary
  • Chapter 17 Time Series Analysis
  • 17.1 Example: Predicting Wikipedia Page Views
  • 17.2 A Typical Workflow
  • 17.3 Time Series versus Time-Stamped Events
  • 17.4 Resampling an Interpolation.
  • 17.5 Smoothing Signals
  • 17.6 Logarithms and Other Transformations
  • 17.7 Trends and Periodicity
  • 17.8 Windowing
  • 17.9 Brainstorming Simple Features
  • 17.10 Better Features: Time Series as Vectors
  • 17.11 Fourier Analysis: Sometimes a Magic Bullet
  • 17.12 Time Series in Context: The Whole Suite of Features
  • 17.13 Further Reading
  • 17.14 Glossary
  • Chapter 18 Probability
  • 18.1 Flipping Coins: Bernoulli Random Variables
  • 18.2 Throwing Darts: Uniform Random Variables
  • 18.3 The Uniform Distribution and Pseudorandom Numbers
  • 18.4 Nondiscrete, Noncontinuous Random Variables
  • 18.5 Notation, Expectations, and Standard Deviation
  • 18.6 Dependence, Marginal and Conditional Probability
  • 18.7 Understanding the Tails
  • 18.8 Binomial Distribution
  • 18.9 Poisson Distribution
  • 18.10 Normal Distribution
  • 18.11 Multivariate Gaussian
  • 18.12 Exponential Distribution
  • 18.13 Log-Normal Distribution
  • 18.14 Entropy
  • 18.15 Further Reading
  • 18.16 Glossary
  • Chapter 19 Statistics
  • 19.1 Statistics in Perspective
  • 19.2 Bayesian versus Frequentist: Practical Tradeoffs and Differing Philosophies
  • 19.3 Hypothesis Testing: Key Idea and Example
  • 19.4 Multiple Hypothesis Testing
  • 19.5 Parameter Estimation
  • 19.6 Hypothesis Testing: t-Test
  • 19.7 Confidence Intervals
  • 19.8 Bayesian Statistics
  • 19.9 Naive Bayesian Statistics
  • 19.10 Bayesian Networks
  • 19.11 Choosing Priors: Maximum Entropy or Domain Knowledge
  • 19.12 Further Reading
  • 19.13 Glossary
  • Chapter 20 Programming Language Concepts
  • 20.1 Programming Paradigms
  • 20.2 Compilation and Interpretation
  • 20.3 Type Systems
  • 20.4 Further Reading
  • 20.5 Glossary
  • Chapter 21 Performance and Computer Memory
  • 21.1 Example Script
  • 21.2 Algorithm Performance and Big-O Notation
  • 21.3 Some Classic Problems: Sorting a List and Binary Search.
  • 21.4 Amortized Performance and Average Performance
  • 21.5 Two Principles: Reducing Overhead and Managing Memory
  • 21.6 Performance Tip: Use Numerical Libraries When Applicable
  • 21.7 Performance Tip: Delete Large Structures You Don't Need
  • 21.8 Performance Tip: Use Built-In Functions When Possible
  • 21.9 Performance Tip: Avoid Superfluous Function Calls
  • 21.10 Performance Tip: Avoid Creating Large New Objects
  • 21.11 Further Reading
  • 21.12 Glossary
  • Part III Specialized or Advanced Topics
  • Chapter 22 Computer Memory and Data Structures
  • 22.1 Virtual Memory, the Stack, and the Heap
  • 22.2 Example C Program
  • 22.3 Data Types and Arrays in Memory
  • 22.4 Structs
  • 22.5 Pointers, the Stack, and the Heap
  • 22.6 Key Data Structures
  • 22.7 Further Reading
  • 22.8 Glossary
  • Chapter 23 Maximum Likelihood Estimation and Optimization
  • 23.1 Maximum Likelihood Estimation
  • 23.2 A Simple Example: Fitting a Line
  • 23.3 Another Example: Logistic Regression
  • 23.4 Optimization
  • 23.5 Gradient Descent and Convex Optimization
  • 23.6 Convex Optimization
  • 23.7 Stochastic Gradient Descent
  • 23.8 Further Reading
  • 23.9 Glossary
  • Chapter 24 Advanced Classifiers
  • 24.1 A Note on Libraries
  • 24.2 Basic Deep Learning
  • 24.3 Convolutional Neural Networks
  • 24.4 Different Types of Layers. What the Heck Is a Tensor?
  • 24.5 Example: The MNIST Handwriting Dataset
  • 24.6 Recurrent Neural Networks
  • 24.7 Bayesian Networks
  • 24.8 Training and Prediction
  • 24.9 Markov Chain Monte Carlo
  • 24.10 PyMC Example
  • 24.11 Further Reading
  • 24.12 Glossary
  • Chapter 25 Stochastic Modeling
  • 25.1 Markov Chains
  • 25.2 Two Kinds of Markov Chain, Two Kinds of Questions
  • 25.3 Markov Chain Monte Carlo
  • 25.4 Hidden Markov Models and the Viterbi Algorithm
  • 25.5 The Viterbi Algorithm
  • 25.6 Random Walks
  • 25.7 Brownian Motion.
  • 25.8 ARIMA Models.