Hands-on big data analytics with pyspark analyze large datasets and discover techniques for testing, immunizing, and parallelizing spark jobs

Use PySpark to easily crush messy data at-scale and discover proven techniques to create testable, immutable, and easily parallelizable Spark jobs Key Features Work with large amounts of agile data using distributed datasets and in-memory caching Source data from all popular data hosting platforms,...

Descripción completa

Detalles Bibliográficos
Otros Autores:	Lai, Rudy, author (author), Potaczek, Bartłomiej, author
Formato:	Libro electrónico
Idioma:	Inglés
Publicado:	Birmingham ; Mumbai : Packt Publishing 2019.
Edición:	1st edition
Materias:	Application software > Development. Python (Computer program language)
Ver en Biblioteca Universitat Ramon Llull:	https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630455606719

Tabla de Contenidos:

Cover
Title Page
Copyright and Credits
About Packt
Contributors
Table of Contents
Preface
Chapter 1: Pyspark and Setting up Your Development Environment
An overview of PySpark
Spark SQL
Setting up Spark on Windows and PySpark
Core concepts in Spark and PySpark
SparkContext
Spark shell
SparkConf
Summary
Chapter 2: Getting Your Big Data into the Spark Environment Using RDDs
Loading data on to Spark RDDs
The UCI machine learning repository
Getting the data from the repository to Spark
Getting data into Spark
Parallelization with Spark RDDs
What is parallelization?
Basics of RDD operation
Summary
Chapter 3: Big Data Cleaning and Wrangling with Spark Notebooks
Using Spark Notebooks for quick iteration of ideas
Sampling/filtering RDDs to pick out relevant data points
Splitting datasets and creating some new combinations
Summary
Chapter 4: Aggregating and Summarizing Data into Useful Reports
Calculating averages with map and reduce
Faster average computations with aggregate
Pivot tabling with key-value paired data points
Summary
Chapter 5: Powerful Exploratory Data Analysis with MLlib
Computing summary statistics with MLlib
Using Pearson and Spearman correlations to discover correlations
The Pearson correlation
The Spearman correlation
Computing Pearson and Spearman correlations
Testing our hypotheses on large datasets
Summary
Chapter 6: Putting Structure on Your Big Data with SparkSQL
Manipulating DataFrames with Spark SQL schemas
Using Spark DSL to build queries
Summary
Chapter 7: Transformations and Actions
Using Spark transformations to defer computations to a later time
Avoiding transformations
Using the reduce and reduceByKey methods to calculate the results.
Performing actions that trigger computations
Reusing the same rdd for different actions
Summary
Chapter 8: Immutable Design
Delving into the Spark RDD's parent/child chain
Extending an RDD
Chaining a new RDD with the parent
Testing our custom RDD
Using RDD in an immutable way
Using DataFrame operations to transform
Immutability in the highly concurrent environment
Using the Dataset API in an immutable way
Summary
Chapter 9: Avoiding Shuffle and Reducing Operational Expenses
Detecting a shuffle in a process
Testing operations that cause a shuffle in Apache Spark
Changing the design of jobs with wide dependencies
Using keyBy() operations to reduce shuffle
Using a custom partitioner to reduce shuffle
Summary
Chapter 10: Saving Data in the Correct Format
Saving data in plain text format
Leveraging JSON as a data format
Tabular formats - CSV
Using Avro with Spark
Columnar formats - Parquet
Summary
Chapter 11: Working with the Spark Key/Value API
Available actions on key/value pairs
Using aggregateByKey instead of groupBy()
Actions on key/value pairs
Available partitioners on key/value data
Implementing a custom partitioner
Summary
Chapter 12: Testing Apache Spark Jobs
Separating logic from Spark engine-unit testing
Integration testing using SparkSession
Mocking data sources using partial functions
Using ScalaCheck for property-based testing
Testing in different versions of Spark
Summary
Chapter 13: Leveraging the Spark GraphX API
Creating a graph from a data source
Creating the loader component
Revisiting the graph format
Loading Spark from file
Using the Vertex API
Constructing a graph using the vertex
Creating couple relationships
Using the Edge API
Constructing the graph using edge.
Calculating the degree of the vertex
The in-degree
The out-degree
Calculating PageRank
Loading and reloading data about users and followers
Summary
Other Books You May Enjoy
Index.

Hands-on big data analytics with pyspark analyze large datasets and discover techniques for testing, immunizing, and parallelizing spark jobs

Ejemplares similares