The data science handbook

Detalles Bibliográficos
Otros Autores:	Cady, Field, 1984- author (author)
Formato:	Libro electrónico
Idioma:	Inglés
Publicado:	Hoboken, New Jersey John Wiley & Sons, Incorporated 2017
Edición:	1st ed
Colección:	THEi Wiley ebooks.
Materias:	Bases de datos Tratamiento de datos > Manuales Big data > Manuales Teoría de la información Bigdata Libros electrónicos
Ver en Biblioteca Universitat Ramon Llull:	https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009849125506719

Tabla de Contenidos:

Cover
Title Page
Copyright
Dedication
Contents
Preface
Chapter 1 Introduction: Becoming a Unicorn
1.1 Aren't Data Scientists Just Overpaid Statisticians?
1.2 How Is This Book Organized?
1.3 How to Use This Book?
1.4 Why Is It All in Python™, Anyway?
1.5 Example Code and Datasets
1.6 Parting Words
Part 1 The Stuff You'll Always Use
Chapter 2 The Data Science Road Map
2.1 Frame the Problem
2.2 Understand the Data: Basic Questions
2.3 Understand the Data: Data Wrangling
2.4 Understand the Data: Exploratory Analysis
2.5 Extract Features
2.6 Model
2.7 Present Results
2.8 Deploy Code
2.9 Iterating
2.10 Glossary
Chapter 3 Programming Languages
3.1 Why Use a Programming Language? What Are the Other Options?
3.2 A Survey of Programming Languages for Data Science
3.3 Python Crash Course
3.4 Strings
3.5 Defining Functions
3.6 Python's Technical Libraries
3.7 Other Python Resources
3.8 Further Reading
3.9 Glossary
Interlude: My Personal Toolkit
Chapter 4 Data Munging: String Manipulation, Regular Expressions, and Data Cleaning
4.1 The Worst Dataset in the World
4.2 How to Identify Pathologies
4.3 Problems with Data Content
4.4 Formatting Issues
4.5 Example Formatting Script
4.6 Regular Expressions
4.7 Life in the Trenches
4.8 Glossary
Chapter 5 Visualizations and Simple Metrics
5.1 A Note on Python's Visualization Tools
5.2 Example Code
5.3 Pie Charts
5.4 Bar Charts
5.5 Histograms
5.6 Means, Standard Deviations, Medians, and Quantiles
5.7 Boxplots
5.8 Scatterplots
5.9 Scatterplots with Logarithmic Axes
5.10 Scatter Matrices
5.11 Heatmaps
5.12 Correlations
5.13 Anscombe's Quartet and the Limits of Numbers
5.14 Time Series
5.15 Further Reading
5.16 Glossary.
Chapter 6 Machine Learning Overview
6.1 Historical Context
6.2 Supervised versus Unsupervised
6.3 Training Data, Testing Data, and the Great Boogeyman of Overfitting
6.4 Further Reading
6.5 Glossary
Chapter 7 Interlude: Feature Extraction Ideas
7.1 Standard Features
7.2 Features That Involve Grouping
7.3 Preview of More Sophisticated Features
7.4 Defining the Feature You Want to Predict
Chapter 8 Machine Learning Classification
8.1 What Is a Classifier, and What Can You Do with It?
8.2 A Few Practical Concerns
8.3 Binary versus Multiclass
8.4 Example Script
8.5 Specific Classifiers
8.6 Evaluating Classifiers
8.7 Selecting Classification Cutoffs
8.8 Further Reading
8.9 Glossary
Chapter 9 Technical Communication and Documentation
9.1 Several Guiding Principles
9.2 Slide Decks
9.3 Written Reports
9.4 Speaking: What Has Worked for Me
9.5 Code Documentation
9.6 Further Reading
9.7 Glossary
Part II Stuff You Still Need to Know
Chapter 10 Unsupervised Learning: Clustering and Dimensionality Reduction
10.1 The Curse of Dimensionality
10.2 Example: Eigenfaces for Dimensionality Reduction
10.3 Principal Component Analysis and Factor Analysis
10.4 Skree Plots and Understanding Dimensionality
10.5 Factor Analysis
10.6 Limitations of PCA
10.7 Clustering
10.8 Further Reading
10.9 Glossary
Chapter 11 Regression
11.1 Example: Predicting Diabetes Progression
11.2 Least Squares
11.3 Fitting Nonlinear Curves
11.4 Goodness of Fit: R2 and Correlation
11.5 Correlation of Residuals
11.6 Linear Regression
11.7 LASSO Regression and Feature Selection
11.8 Further Reading
11.9 Glossary
Chapter 12 Data Encodings and File Formats
12.1 Typical File Format Categories
12.2 CSV Files
12.3 JSON Files
12.4 XML Files.
12.5 HTML Files
12.6 Tar Files
12.7 GZip Files
12.8 Zip Files
12.9 Image Files: Rasterized, Vectorized, and/or Compressed
12.10 It's All Bytes at the End of the Day
12.11 Integers
12.12 Floats
12.13 Text Data
12.14 Further Reading
12.15 Glossary
Chapter 13 Big Data
13.1 What Is Big Data?
13.2 Hadoop: The File System and the Processor
13.3 Using HDFS
13.4 Example PySpark Script
13.5 Spark Overview
13.6 Spark Operations
13.7 Two Ways to Run PySpark
13.8 Configuring Spark
13.9 Under the Hood
13.10 Spark Tips and Gotchas
13.11 The MapReduce Paradigm
13.12 Performance Considerations
13.13 Further Reading
13.14 Glossary
Chapter 14 Databases
14.1 Relational Databases and MySQL®
14.2 Key-Value Stores
14.3 Wide Column Stores
14.4 Document Stores
14.5 Further Reading
14.6 Glossary
Chapter 15 Software Engineering Best Practices
15.1 Coding Style
15.2 Version Control and Git for Data Scientists
15.3 Testing Code
15.4 Test-Driven Development
15.5 AGILE Methodology
15.6 Further Reading
15.7 Glossary
Chapter 16 Natural Language Processing
16.1 Do I Even Need NLP?
16.2 The Great Divide: Language versus Statistics
16.3 Example: Sentiment Analysis on Stock Market Articles
16.4 Software and Datasets
16.5 Tokenization
16.6 Central Concept: Bag-of-Words
16.7 Word Weighting: TF-IDF
16.8 n-Grams
16.9 Stop Words
16.10 Lemmatization and Stemming
16.11 Synonyms
16.12 Part of Speech Tagging
16.13 Common Problems
16.14 Advanced NLP: Syntax Trees, Knowledge, and Understanding
16.15 Further Reading
16.16 Glossary
Chapter 17 Time Series Analysis
17.1 Example: Predicting Wikipedia Page Views
17.2 A Typical Workflow
17.3 Time Series versus Time-Stamped Events
17.4 Resampling an Interpolation.
17.5 Smoothing Signals
17.6 Logarithms and Other Transformations
17.7 Trends and Periodicity
17.8 Windowing
17.9 Brainstorming Simple Features
17.10 Better Features: Time Series as Vectors
17.11 Fourier Analysis: Sometimes a Magic Bullet
17.12 Time Series in Context: The Whole Suite of Features
17.13 Further Reading
17.14 Glossary
Chapter 18 Probability
18.1 Flipping Coins: Bernoulli Random Variables
18.2 Throwing Darts: Uniform Random Variables
18.3 The Uniform Distribution and Pseudorandom Numbers
18.4 Nondiscrete, Noncontinuous Random Variables
18.5 Notation, Expectations, and Standard Deviation
18.6 Dependence, Marginal and Conditional Probability
18.7 Understanding the Tails
18.8 Binomial Distribution
18.9 Poisson Distribution
18.10 Normal Distribution
18.11 Multivariate Gaussian
18.12 Exponential Distribution
18.13 Log-Normal Distribution
18.14 Entropy
18.15 Further Reading
18.16 Glossary
Chapter 19 Statistics
19.1 Statistics in Perspective
19.2 Bayesian versus Frequentist: Practical Tradeoffs and Differing Philosophies
19.3 Hypothesis Testing: Key Idea and Example
19.4 Multiple Hypothesis Testing
19.5 Parameter Estimation
19.6 Hypothesis Testing: t-Test
19.7 Confidence Intervals
19.8 Bayesian Statistics
19.9 Naive Bayesian Statistics
19.10 Bayesian Networks
19.11 Choosing Priors: Maximum Entropy or Domain Knowledge
19.12 Further Reading
19.13 Glossary
Chapter 20 Programming Language Concepts
20.1 Programming Paradigms
20.2 Compilation and Interpretation
20.3 Type Systems
20.4 Further Reading
20.5 Glossary
Chapter 21 Performance and Computer Memory
21.1 Example Script
21.2 Algorithm Performance and Big-O Notation
21.3 Some Classic Problems: Sorting a List and Binary Search.
21.4 Amortized Performance and Average Performance
21.5 Two Principles: Reducing Overhead and Managing Memory
21.6 Performance Tip: Use Numerical Libraries When Applicable
21.7 Performance Tip: Delete Large Structures You Don't Need
21.8 Performance Tip: Use Built-In Functions When Possible
21.9 Performance Tip: Avoid Superfluous Function Calls
21.10 Performance Tip: Avoid Creating Large New Objects
21.11 Further Reading
21.12 Glossary
Part III Specialized or Advanced Topics
Chapter 22 Computer Memory and Data Structures
22.1 Virtual Memory, the Stack, and the Heap
22.2 Example C Program
22.3 Data Types and Arrays in Memory
22.4 Structs
22.5 Pointers, the Stack, and the Heap
22.6 Key Data Structures
22.7 Further Reading
22.8 Glossary
Chapter 23 Maximum Likelihood Estimation and Optimization
23.1 Maximum Likelihood Estimation
23.2 A Simple Example: Fitting a Line
23.3 Another Example: Logistic Regression
23.4 Optimization
23.5 Gradient Descent and Convex Optimization
23.6 Convex Optimization
23.7 Stochastic Gradient Descent
23.8 Further Reading
23.9 Glossary
Chapter 24 Advanced Classifiers
24.1 A Note on Libraries
24.2 Basic Deep Learning
24.3 Convolutional Neural Networks
24.4 Different Types of Layers. What the Heck Is a Tensor?
24.5 Example: The MNIST Handwriting Dataset
24.6 Recurrent Neural Networks
24.7 Bayesian Networks
24.8 Training and Prediction
24.9 Markov Chain Monte Carlo
24.10 PyMC Example
24.11 Further Reading
24.12 Glossary
Chapter 25 Stochastic Modeling
25.1 Markov Chains
25.2 Two Kinds of Markov Chain, Two Kinds of Questions
25.3 Markov Chain Monte Carlo
25.4 Hidden Markov Models and the Viterbi Algorithm
25.5 The Viterbi Algorithm
25.6 Random Walks
25.7 Brownian Motion.
25.8 ARIMA Models.

The data science handbook

Ejemplares similares