Hands-on big data analytics with pyspark analyze large datasets and discover techniques for testing, immunizing, and parallelizing spark jobs
Use PySpark to easily crush messy data at-scale and discover proven techniques to create testable, immutable, and easily parallelizable Spark jobs Key Features Work with large amounts of agile data using distributed datasets and in-memory caching Source data from all popular data hosting platforms,...
Otros Autores: | , |
---|---|
Formato: | Libro electrónico |
Idioma: | Inglés |
Publicado: |
Birmingham ; Mumbai :
Packt Publishing
2019.
|
Edición: | 1st edition |
Materias: | |
Ver en Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630455606719 |
Tabla de Contenidos:
- Cover
- Title Page
- Copyright and Credits
- About Packt
- Contributors
- Table of Contents
- Preface
- Chapter 1: Pyspark and Setting up Your Development Environment
- An overview of PySpark
- Spark SQL
- Setting up Spark on Windows and PySpark
- Core concepts in Spark and PySpark
- SparkContext
- Spark shell
- SparkConf
- Summary
- Chapter 2: Getting Your Big Data into the Spark Environment Using RDDs
- Loading data on to Spark RDDs
- The UCI machine learning repository
- Getting the data from the repository to Spark
- Getting data into Spark
- Parallelization with Spark RDDs
- What is parallelization?
- Basics of RDD operation
- Summary
- Chapter 3: Big Data Cleaning and Wrangling with Spark Notebooks
- Using Spark Notebooks for quick iteration of ideas
- Sampling/filtering RDDs to pick out relevant data points
- Splitting datasets and creating some new combinations
- Summary
- Chapter 4: Aggregating and Summarizing Data into Useful Reports
- Calculating averages with map and reduce
- Faster average computations with aggregate
- Pivot tabling with key-value paired data points
- Summary
- Chapter 5: Powerful Exploratory Data Analysis with MLlib
- Computing summary statistics with MLlib
- Using Pearson and Spearman correlations to discover correlations
- The Pearson correlation
- The Spearman correlation
- Computing Pearson and Spearman correlations
- Testing our hypotheses on large datasets
- Summary
- Chapter 6: Putting Structure on Your Big Data with SparkSQL
- Manipulating DataFrames with Spark SQL schemas
- Using Spark DSL to build queries
- Summary
- Chapter 7: Transformations and Actions
- Using Spark transformations to defer computations to a later time
- Avoiding transformations
- Using the reduce and reduceByKey methods to calculate the results.
- Performing actions that trigger computations
- Reusing the same rdd for different actions
- Summary
- Chapter 8: Immutable Design
- Delving into the Spark RDD's parent/child chain
- Extending an RDD
- Chaining a new RDD with the parent
- Testing our custom RDD
- Using RDD in an immutable way
- Using DataFrame operations to transform
- Immutability in the highly concurrent environment
- Using the Dataset API in an immutable way
- Summary
- Chapter 9: Avoiding Shuffle and Reducing Operational Expenses
- Detecting a shuffle in a process
- Testing operations that cause a shuffle in Apache Spark
- Changing the design of jobs with wide dependencies
- Using keyBy() operations to reduce shuffle
- Using a custom partitioner to reduce shuffle
- Summary
- Chapter 10: Saving Data in the Correct Format
- Saving data in plain text format
- Leveraging JSON as a data format
- Tabular formats - CSV
- Using Avro with Spark
- Columnar formats - Parquet
- Summary
- Chapter 11: Working with the Spark Key/Value API
- Available actions on key/value pairs
- Using aggregateByKey instead of groupBy()
- Actions on key/value pairs
- Available partitioners on key/value data
- Implementing a custom partitioner
- Summary
- Chapter 12: Testing Apache Spark Jobs
- Separating logic from Spark engine-unit testing
- Integration testing using SparkSession
- Mocking data sources using partial functions
- Using ScalaCheck for property-based testing
- Testing in different versions of Spark
- Summary
- Chapter 13: Leveraging the Spark GraphX API
- Creating a graph from a data source
- Creating the loader component
- Revisiting the graph format
- Loading Spark from file
- Using the Vertex API
- Constructing a graph using the vertex
- Creating couple relationships
- Using the Edge API
- Constructing the graph using edge.
- Calculating the degree of the vertex
- The in-degree
- The out-degree
- Calculating PageRank
- Loading and reloading data about users and followers
- Summary
- Other Books You May Enjoy
- Index.