PySpark cookbook over 60 recipes for implementing big data processing and analytics using Apache Spark and Python
Combine the power of Apache Spark and Python to build effective big data applications About This Book Perform effective data processing, machine learning, and analytics using PySpark Overcome challenges in developing and deploying Spark solutions using Python Explore recipes for efficiently combinin...
Other Authors: | , |
---|---|
Format: | eBook |
Language: | Inglés |
Published: |
Birmingham, UK ; Mumbai :
Packt
2018.
|
Edition: | 1st edition |
Subjects: | |
See on Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630446806719 |
Table of Contents:
- Cover
- Title Page
- Copyright and Credits
- Packt Upsell
- Contributors
- Table of Contents
- Preface
- Chapter 1: Installing and Configuring Spark
- Introduction
- Installing Spark requirements
- Getting ready
- How to do it...
- How it works...
- There's more...
- Installing Java
- Installing Python
- Installing R
- Installing Scala
- Installing Maven
- Updating PATH
- Installing Spark from sources
- Getting ready
- How to do it...
- How it works...
- There's more...
- See also
- Installing Spark from binaries
- Getting ready
- How to do it...
- How it works...
- There's more...
- Configuring a local instance of Spark
- Getting ready
- How to do it...
- How it works...
- See also
- Configuring a multi-node instance of Spark
- Getting ready
- How to do it...
- How it works...
- See also
- Installing Jupyter
- Getting ready
- How to do it...
- How it works...
- There's more...
- See also
- Configuring a session in Jupyter
- Getting ready
- How to do it...
- How it works...
- There's more...
- See also
- Working with Cloudera Spark images
- Getting ready
- How to do it...
- How it works...
- Chapter 2: Abstracting Data with RDDs
- Introduction
- Creating RDDs
- Getting ready
- How to do it...
- How it works...
- Spark context parallelize method
- .take(...) method
- Reading data from files
- Getting ready
- How to do it...
- How it works...
- .textFile(...) method
- .map(...) method
- Partitions and performance
- Overview of RDD transformations
- Getting ready
- How to do it...
- .map(...) transformation
- .filter(...) transformation
- .flatMap(...) transformation
- .distinct() transformation
- .sample(...) transformation
- .join(...) transformation
- .repartition(...) transformation
- .zipWithIndex() transformation
- .reduceByKey(...) transformation.
- .sortByKey(...) transformation
- .union(...) transformation
- .mapPartitionsWithIndex(...) transformation
- How it works...
- Overview of RDD actions
- Getting ready
- How to do it...
- .take(...) action
- .collect() action
- .reduce(...) action
- .count() action
- .saveAsTextFile(...) action
- How it works...
- Pitfalls of using RDDs
- Getting ready
- How to do it...
- How it works...
- Chapter 3: Abstracting Data with DataFrames
- Introduction
- Creating DataFrames
- Getting ready
- How to do it...
- How it works...
- There's more...
- From JSON
- From CSV
- See also
- Accessing underlying RDDs
- Getting ready
- How to do it...
- How it works...
- Performance optimizations
- Getting ready
- How to do it...
- How it works...
- There's more...
- See also
- Inferring the schema using reflection
- Getting ready
- How to do it...
- How it works...
- See also
- Specifying the schema programmatically
- Getting ready
- How to do it...
- How it works...
- See also
- Creating a temporary table
- Getting ready
- How to do it...
- How it works...
- There's more...
- Using SQL to interact with DataFrames
- Getting ready
- How to do it...
- How it works...
- There's more...
- Overview of DataFrame transformations
- Getting ready
- How to do it...
- The .select(...) transformation
- The .filter(...) transformation
- The .groupBy(...) transformation
- The .orderBy(...) transformation
- The .withColumn(...) transformation
- The .join(...) transformation
- The .unionAll(...) transformation
- The .distinct(...) transformation
- The .repartition(...) transformation
- The .fillna(...) transformation
- The .dropna(...) transformation
- The .dropDuplicates(...) transformation
- The .summary() and .describe() transformations
- The .freqItems(...) transformation
- See also.
- Overview of DataFrame actions
- Getting ready
- How to do it...
- The .show(...) action
- The .collect() action
- The .take(...) action
- The .toPandas() action
- See also
- Chapter 4: Preparing Data for Modeling
- Introduction
- Handling duplicates
- Getting ready
- How to do it...
- How it works...
- There's more...
- Only IDs differ
- ID collisions
- Handling missing observations
- Getting ready
- How to do it...
- How it works...
- Missing observations per row
- Missing observations per column
- There's more...
- See also
- Handling outliers
- Getting ready
- How to do it...
- How it works...
- See also
- Exploring descriptive statistics
- Getting ready
- How to do it...
- How it works...
- There's more...
- Descriptive statistics for aggregated columns
- See also
- Computing correlations
- Getting ready
- How to do it...
- How it works...
- There's more...
- Drawing histograms
- Getting ready
- How to do it...
- How it works...
- There's more...
- See also
- Visualizing interactions between features
- Getting ready
- How to do it...
- How it works...
- There's more...
- Chapter 5: Machine Learning with MLlib
- Loading the data
- Getting ready
- How to do it...
- How it works...
- There's more...
- Exploring the data
- Getting ready
- How to do it...
- How it works...
- Numerical features
- Categorical features
- There's more...
- See also
- Testing the data
- Getting ready
- How to do it...
- How it works...
- See also...
- Transforming the data
- Getting ready
- How to do it...
- How it works...
- There's more...
- See also...
- Standardizing the data
- Getting ready
- How to do it...
- How it works...
- Creating an RDD for training
- Getting ready
- How to do it...
- Classification
- Regression
- How it works...
- There's more...
- See also.
- Predicting hours of work for census respondents
- Getting ready
- How to do it...
- How it works...
- Forecasting the income levels of census respondents
- Getting ready
- How to do it...
- How it works...
- There's more...
- Building a clustering models
- Getting ready
- How to do it...
- How it works...
- There's more...
- See also
- Computing performance statistics
- Getting ready
- How to do it...
- How it works...
- Regression metrics
- Classification metrics
- See also
- Chapter 6: Machine Learning with the ML Module
- Introducing Transformers
- Getting ready
- How to do it...
- How it works...
- There's more...
- See also
- Introducing Estimators
- Getting ready
- How to do it...
- How it works...
- There's more...
- Introducing Pipelines
- Getting ready
- How to do it...
- How it works...
- See also
- Selecting the most predictable features
- Getting ready
- How to do it...
- How it works...
- There's more...
- See also
- Predicting forest coverage types
- Getting ready
- How to do it...
- How it works...
- There's more...
- Estimating forest elevation
- Getting ready
- How to do it...
- How it works...
- There's more...
- Clustering forest cover types
- Getting ready
- How to do it...
- How it works...
- See also
- Tuning hyperparameters
- Getting ready
- How to do it...
- How it works...
- There's more...
- Extracting features from text
- Getting ready
- How to do it...
- How it works...
- There's more...
- See also
- Discretizing continuous variables
- Getting ready
- How to do it...
- How it works...
- Standardizing continuous variables
- Getting ready
- How to do it...
- How it works...
- Topic mining
- Getting ready
- How to do it...
- How it works...
- Chapter 7: Structured Streaming with PySpark
- Introduction
- Understanding Spark Streaming.
- Understanding DStreams
- Getting ready
- How to do it...
- Terminal 1 - Netcat window
- Terminal 2 - Spark Streaming window
- How it works...
- There's more...
- Understanding global aggregations
- Getting ready
- How to do it...
- Terminal 1 - Netcat window
- Terminal 2 - Spark Streaming window
- How it works...
- Continuous aggregation with structured streaming
- Getting ready
- How to do it...
- Terminal 1 - Netcat window
- Terminal 2 - Spark Streaming window
- How it works...
- Chapter 8: GraphFrames - Graph Theory with PySpark
- Introduction
- Installing GraphFrames
- Getting ready
- How to do it...
- How it works...
- Preparing the data
- Getting ready
- How to do it...
- How it works...
- There's more...
- Building the graph
- How to do it...
- How it works...
- Running queries against the graph
- Getting ready
- How to do it...
- How it works...
- Understanding the graph
- Getting ready
- How to do it...
- How it works...
- Using PageRank to determine airport ranking
- Getting ready
- How to do it...
- How it works...
- Finding the fewest number of connections
- Getting ready
- How to do it...
- How it works...
- There's more...
- See also
- Visualizing the graph
- Getting ready
- How to do it...
- How it works...
- Index.