PySpark cookbook over 60 recipes for implementing big data processing and analytics using Apache Spark and Python

Combine the power of Apache Spark and Python to build effective big data applications About This Book Perform effective data processing, machine learning, and analytics using PySpark Overcome challenges in developing and deploying Spark solutions using Python Explore recipes for efficiently combinin...

Descripción completa

Detalles Bibliográficos
Otros Autores:	Lee, Denny, author (author), Drabas, Tomasz, author
Formato:	Libro electrónico
Idioma:	Inglés
Publicado:	Birmingham, UK ; Mumbai : Packt 2018.
Edición:	1st edition
Materias:	SPARK (Computer program language) Python (Computer program language) Application software > Development.
Ver en Biblioteca Universitat Ramon Llull:	https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630446806719

Tabla de Contenidos:

Cover
Title Page
Copyright and Credits
Packt Upsell
Contributors
Table of Contents
Preface
Chapter 1: Installing and Configuring Spark
Introduction
Installing Spark requirements
Getting ready
How to do it...
How it works...
There's more...
Installing Java
Installing Python
Installing R
Installing Scala
Installing Maven
Updating PATH
Installing Spark from sources
Getting ready
How to do it...
How it works...
There's more...
See also
Installing Spark from binaries
Getting ready
How to do it...
How it works...
There's more...
Configuring a local instance of Spark
Getting ready
How to do it...
How it works...
See also
Configuring a multi-node instance of Spark
Getting ready
How to do it...
How it works...
See also
Installing Jupyter
Getting ready
How to do it...
How it works...
There's more...
See also
Configuring a session in Jupyter
Getting ready
How to do it...
How it works...
There's more...
See also
Working with Cloudera Spark images
Getting ready
How to do it...
How it works...
Chapter 2: Abstracting Data with RDDs
Introduction
Creating RDDs
Getting ready
How to do it...
How it works...
Spark context parallelize method
.take(...) method
Reading data from files
Getting ready
How to do it...
How it works...
.textFile(...) method
.map(...) method
Partitions and performance
Overview of RDD transformations
Getting ready
How to do it...
.map(...) transformation
.filter(...) transformation
.flatMap(...) transformation
.distinct() transformation
.sample(...) transformation
.join(...) transformation
.repartition(...) transformation
.zipWithIndex() transformation
.reduceByKey(...) transformation.
.sortByKey(...) transformation
.union(...) transformation
.mapPartitionsWithIndex(...) transformation
How it works...
Overview of RDD actions
Getting ready
How to do it...
.take(...) action
.collect() action
.reduce(...) action
.count() action
.saveAsTextFile(...) action
How it works...
Pitfalls of using RDDs
Getting ready
How to do it...
How it works...
Chapter 3: Abstracting Data with DataFrames
Introduction
Creating DataFrames
Getting ready
How to do it...
How it works...
There's more...
From JSON
From CSV
See also
Accessing underlying RDDs
Getting ready
How to do it...
How it works...
Performance optimizations
Getting ready
How to do it...
How it works...
There's more...
See also
Inferring the schema using reflection
Getting ready
How to do it...
How it works...
See also
Specifying the schema programmatically
Getting ready
How to do it...
How it works...
See also
Creating a temporary table
Getting ready
How to do it...
How it works...
There's more...
Using SQL to interact with DataFrames
Getting ready
How to do it...
How it works...
There's more...
Overview of DataFrame transformations
Getting ready
How to do it...
The .select(...) transformation
The .filter(...) transformation
The .groupBy(...) transformation
The .orderBy(...) transformation
The .withColumn(...) transformation
The .join(...) transformation
The .unionAll(...) transformation
The .distinct(...) transformation
The .repartition(...) transformation
The .fillna(...) transformation
The .dropna(...) transformation
The .dropDuplicates(...) transformation
The .summary() and .describe() transformations
The .freqItems(...) transformation
See also.
Overview of DataFrame actions
Getting ready
How to do it...
The .show(...) action
The .collect() action
The .take(...) action
The .toPandas() action
See also
Chapter 4: Preparing Data for Modeling
Introduction
Handling duplicates
Getting ready
How to do it...
How it works...
There's more...
Only IDs differ
ID collisions
Handling missing observations
Getting ready
How to do it...
How it works...
Missing observations per row
Missing observations per column
There's more...
See also
Handling outliers
Getting ready
How to do it...
How it works...
See also
Exploring descriptive statistics
Getting ready
How to do it...
How it works...
There's more...
Descriptive statistics for aggregated columns
See also
Computing correlations
Getting ready
How to do it...
How it works...
There's more...
Drawing histograms
Getting ready
How to do it...
How it works...
There's more...
See also
Visualizing interactions between features
Getting ready
How to do it...
How it works...
There's more...
Chapter 5: Machine Learning with MLlib
Loading the data
Getting ready
How to do it...
How it works...
There's more...
Exploring the data
Getting ready
How to do it...
How it works...
Numerical features
Categorical features
There's more...
See also
Testing the data
Getting ready
How to do it...
How it works...
See also...
Transforming the data
Getting ready
How to do it...
How it works...
There's more...
See also...
Standardizing the data
Getting ready
How to do it...
How it works...
Creating an RDD for training
Getting ready
How to do it...
Classification
Regression
How it works...
There's more...
See also.
Predicting hours of work for census respondents
Getting ready
How to do it...
How it works...
Forecasting the income levels of census respondents
Getting ready
How to do it...
How it works...
There's more...
Building a clustering models
Getting ready
How to do it...
How it works...
There's more...
See also
Computing performance statistics
Getting ready
How to do it...
How it works...
Regression metrics
Classification metrics
See also
Chapter 6: Machine Learning with the ML Module
Introducing Transformers
Getting ready
How to do it...
How it works...
There's more...
See also
Introducing Estimators
Getting ready
How to do it...
How it works...
There's more...
Introducing Pipelines
Getting ready
How to do it...
How it works...
See also
Selecting the most predictable features
Getting ready
How to do it...
How it works...
There's more...
See also
Predicting forest coverage types
Getting ready
How to do it...
How it works...
There's more...
Estimating forest elevation
Getting ready
How to do it...
How it works...
There's more...
Clustering forest cover types
Getting ready
How to do it...
How it works...
See also
Tuning hyperparameters
Getting ready
How to do it...
How it works...
There's more...
Extracting features from text
Getting ready
How to do it...
How it works...
There's more...
See also
Discretizing continuous variables
Getting ready
How to do it...
How it works...
Standardizing continuous variables
Getting ready
How to do it...
How it works...
Topic mining
Getting ready
How to do it...
How it works...
Chapter 7: Structured Streaming with PySpark
Introduction
Understanding Spark Streaming.
Understanding DStreams
Getting ready
How to do it...
Terminal 1 - Netcat window
Terminal 2 - Spark Streaming window
How it works...
There's more...
Understanding global aggregations
Getting ready
How to do it...
Terminal 1 - Netcat window
Terminal 2 - Spark Streaming window
How it works...
Continuous aggregation with structured streaming
Getting ready
How to do it...
Terminal 1 - Netcat window
Terminal 2 - Spark Streaming window
How it works...
Chapter 8: GraphFrames - Graph Theory with PySpark
Introduction
Installing GraphFrames
Getting ready
How to do it...
How it works...
Preparing the data
Getting ready
How to do it...
How it works...
There's more...
Building the graph
How to do it...
How it works...
Running queries against the graph
Getting ready
How to do it...
How it works...
Understanding the graph
Getting ready
How to do it...
How it works...
Using PageRank to determine airport ranking
Getting ready
How to do it...
How it works...
Finding the fewest number of connections
Getting ready
How to do it...
How it works...
There's more...
See also
Visualizing the graph
Getting ready
How to do it...
How it works...
Index.

PySpark cookbook over 60 recipes for implementing big data processing and analytics using Apache Spark and Python

Ejemplares similares