Hands-on big data analytics with pyspark analyze large datasets and discover techniques for testing, immunizing, and parallelizing spark jobs

Use PySpark to easily crush messy data at-scale and discover proven techniques to create testable, immutable, and easily parallelizable Spark jobs Key Features Work with large amounts of agile data using distributed datasets and in-memory caching Source data from all popular data hosting platforms,...

Descripción completa

Detalles Bibliográficos
Otros Autores: Lai, Rudy, author (author), Potaczek, Bartłomiej, author
Formato: Libro electrónico
Idioma:Inglés
Publicado: Birmingham ; Mumbai : Packt Publishing 2019.
Edición:1st edition
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630455606719
Tabla de Contenidos:
  • Cover
  • Title Page
  • Copyright and Credits
  • About Packt
  • Contributors
  • Table of Contents
  • Preface
  • Chapter 1: Pyspark and Setting up Your Development Environment
  • An overview of PySpark
  • Spark SQL
  • Setting up Spark on Windows and PySpark
  • Core concepts in Spark and PySpark
  • SparkContext
  • Spark shell
  • SparkConf
  • Summary
  • Chapter 2: Getting Your Big Data into the Spark Environment Using RDDs
  • Loading data on to Spark RDDs
  • The UCI machine learning repository
  • Getting the data from the repository to Spark
  • Getting data into Spark
  • Parallelization with Spark RDDs
  • What is parallelization?
  • Basics of RDD operation
  • Summary
  • Chapter 3: Big Data Cleaning and Wrangling with Spark Notebooks
  • Using Spark Notebooks for quick iteration of ideas
  • Sampling/filtering RDDs to pick out relevant data points
  • Splitting datasets and creating some new combinations
  • Summary
  • Chapter 4: Aggregating and Summarizing Data into Useful Reports
  • Calculating averages with map and reduce
  • Faster average computations with aggregate
  • Pivot tabling with key-value paired data points
  • Summary
  • Chapter 5: Powerful Exploratory Data Analysis with MLlib
  • Computing summary statistics with MLlib
  • Using Pearson and Spearman correlations to discover correlations
  • The Pearson correlation
  • The Spearman correlation
  • Computing Pearson and Spearman correlations
  • Testing our hypotheses on large datasets
  • Summary
  • Chapter 6: Putting Structure on Your Big Data with SparkSQL
  • Manipulating DataFrames with Spark SQL schemas
  • Using Spark DSL to build queries
  • Summary
  • Chapter 7: Transformations and Actions
  • Using Spark transformations to defer computations to a later time
  • Avoiding transformations
  • Using the reduce and reduceByKey methods to calculate the results.
  • Performing actions that trigger computations
  • Reusing the same rdd for different actions
  • Summary
  • Chapter 8: Immutable Design
  • Delving into the Spark RDD's parent/child chain
  • Extending an RDD
  • Chaining a new RDD with the parent
  • Testing our custom RDD
  • Using RDD in an immutable way
  • Using DataFrame operations to transform
  • Immutability in the highly concurrent environment
  • Using the Dataset API in an immutable way
  • Summary
  • Chapter 9: Avoiding Shuffle and Reducing Operational Expenses
  • Detecting a shuffle in a process
  • Testing operations that cause a shuffle in Apache Spark
  • Changing the design of jobs with wide dependencies
  • Using keyBy() operations to reduce shuffle
  • Using a custom partitioner to reduce shuffle
  • Summary
  • Chapter 10: Saving Data in the Correct Format
  • Saving data in plain text format
  • Leveraging JSON as a data format
  • Tabular formats - CSV
  • Using Avro with Spark
  • Columnar formats - Parquet
  • Summary
  • Chapter 11: Working with the Spark Key/Value API
  • Available actions on key/value pairs
  • Using aggregateByKey instead of groupBy()
  • Actions on key/value pairs
  • Available partitioners on key/value data
  • Implementing a custom partitioner
  • Summary
  • Chapter 12: Testing Apache Spark Jobs
  • Separating logic from Spark engine-unit testing
  • Integration testing using SparkSession
  • Mocking data sources using partial functions
  • Using ScalaCheck for property-based testing
  • Testing in different versions of Spark
  • Summary
  • Chapter 13: Leveraging the Spark GraphX API
  • Creating a graph from a data source
  • Creating the loader component
  • Revisiting the graph format
  • Loading Spark from file
  • Using the Vertex API
  • Constructing a graph using the vertex
  • Creating couple relationships
  • Using the Edge API
  • Constructing the graph using edge.
  • Calculating the degree of the vertex
  • The in-degree
  • The out-degree
  • Calculating PageRank
  • Loading and reloading data about users and followers
  • Summary
  • Other Books You May Enjoy
  • Index.