Learning Spark SQL architect streaming analytics and machine learning solutions

Design, implement, and deliver successful streaming applications, machine learning pipelines and graph applications using Spark SQL API About This Book Learn about the design and implementation of streaming applications, machine learning pipelines, deep learning, and large-scale graph processing app...

Descripción completa

Detalles Bibliográficos
Otros Autores:	Sarkar, Aurobindo, author (author)
Formato:	Libro electrónico
Idioma:	Inglés
Publicado:	Birmingham, [England] ; Mumbai, [India] : Packt Publishing 2017.
Edición:	1st edition
Materias:	Spark (Electronic resource : Apache Software Foundation) Data mining. Big data.
Ver en Biblioteca Universitat Ramon Llull:	https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630503006719

Tabla de Contenidos:

Cover
Title Page
Copyright
Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Table of Contents
Preface
Chapter 1: Getting Started with Spark SQL
What is Spark SQL?
Introducing SparkSession
Understanding Spark SQL concepts
Understanding Resilient Distributed Datasets (RDDs)
Understanding DataFrames and Datasets
Understanding the Catalyst optimizer
Understanding Catalyst optimizations
Understanding Catalyst transformations
Introducing Project Tungsten
Using Spark SQL in streaming applications
Understanding Structured Streaming internals
Summary
Chapter 2: Using Spark SQL for Processing Structured and Semistructured Data
Understanding data sources in Spark applications
Selecting Spark data sources
Using Spark with relational databases
Using Spark with MongoDB (NoSQL database)
Using Spark with JSON data
Using Spark with Avro files
Using Spark with Parquet files
Defining and using custom data sources in Spark
Summary
Chapter 3: Using Spark SQL for Data Exploration
Introducing Exploratory Data Analysis (EDA)
Using Spark SQL for basic data analysis
Identifying missing data
Computing basic statistics
Identifying data outliers
Visualizing data with Apache Zeppelin
Sampling data with Spark SQL APIs
Sampling with the DataFrame/Dataset API
Sampling with the RDD API
Using Spark SQL for creating pivot tables
Summary
Chapter 4: Using Spark SQL for Data Munging
Introducing data munging
Exploring data munging techniques
Pre-processing of the&amp
#160
household electric consumption Dataset
Computing basic statistics and aggregations
Augmenting the Dataset
Executing other miscellaneous processing steps
Pre-processing of&amp
#160
the weather Dataset.
Analyzing missing data
Combining data using a JOIN operation
Munging textual data
Processing multiple input data files
Removing stop words
Munging time series data
Pre-processing of the&amp
#160
time-series Dataset
Processing date fields
Persisting and loading data
Defining a date-time index
Using the&amp
#160
&amp
#160
TimeSeriesRDD&amp
#160
object
Handling missing time-series data
Computing basic statistics
Dealing with variable length records
Converting variable-length records to fixed-length records
Extracting data from "messy" columns
Preparing data for machine learning
Pre-processing data for machine learning
Creating and running a machine learning pipeline
Summary
Chapter 5: Using Spark SQL in Streaming Applications
Introducing streaming data applications
Building Spark streaming applications
Implementing sliding window-based functionality
Joining a streaming Dataset with a static Dataset
Using the Dataset API in Structured Streaming
Using output sinks
Using the Foreach Sink for arbitrary computations on output
Using the Memory Sink to save output to a table
Using the File Sink to save output to a partitioned table
Monitoring streaming queries
Using Kafka with Spark Structured Streaming
Introducing Kafka concepts
Introducing ZooKeeper concepts
Introducing Kafka-Spark integration
Introducing Kafka-Spark Structured Streaming
Writing a receiver for a custom data source
Summary
Chapter 6: Using Spark SQL in Machine Learning Applications
Introducing machine learning applications
Understanding Spark ML pipelines and their components
Understanding the steps in a pipeline application development process
Introducing feature engineering
Creating new features from raw data.
Estimating the importance of a feature
Understanding dimensionality reduction
Deriving good features
Implementing a Spark ML classification model
Exploring the diabetes Dataset
Pre-processing the data
Building the Spark ML pipeline
Using StringIndexer for indexing categorical features and labels
Using VectorAssembler for assembling features into one column
Using a Spark ML classifier
Creating a Spark ML pipeline
Creating the training and test Datasets
Making predictions using the PipelineModel
Selecting the best model
Changing the ML algorithm in the pipeline
Introducing Spark ML tools and utilities
Using Principal Component Analysis to select features
Using encoders
Using Bucketizer
Using VectorSlicer
Using Chi-squared selector
Using a Normalizer
Retrieving our original labels
Implementing a Spark ML clustering model
Summary
Chapter 7: Using Spark SQL in Graph Applications
Introducing large-scale graph applications
Exploring graphs using GraphFrames
Constructing a GraphFrame
Basic graph queries and operations
Motif analysis using GraphFrames
Processing subgraphs
Applying graph algorithms
Saving and loading GraphFrames
Analyzing JSON input modeled as a graph&amp
#160
Processing graphs containing multiple types of relationships
Understanding GraphFrame internals
Viewing GraphFrame physical execution plan
Understanding partitioning in GraphFrames
Summary
Chapter 8: Using Spark SQL with SparkR
Introducing SparkR
Understanding the SparkR architecture
Understanding SparkR DataFrames
Using SparkR for EDA and data munging tasks
Reading and writing Spark DataFrames
Exploring structure and contents of Spark DataFrames
Running basic operations on Spark DataFrames
Executing SQL statements on Spark DataFrames.
Merging SparkR DataFrames
Using User Defined Functions (UDFs)
Using SparkR for computing summary statistics
Using SparkR for data visualization
Visualizing data on a map
Visualizing graph nodes and edges
Using SparkR for machine learning
Summary
Chapter 9: Developing Applications with Spark SQL
Introducing Spark SQL applications
Understanding text analysis applications
Using Spark SQL for textual analysis
Preprocessing textual data
Computing readability
Using word lists
Creating data preprocessing pipelines
Understanding themes in document corpuses
Using Naive Bayes classifiers
Developing a machine learning application
Summary
Chapter 10: Using Spark SQL in Deep Learning Applications
Introducing neural networks
Understanding deep learning
Understanding representation learning
Understanding stochastic gradient descent
Introducing deep learning in Spark
Introducing CaffeOnSpark
Introducing DL4J
Introducing TensorFrames
Working with BigDL
Tuning hyperparameters of deep learning models
Introducing deep learning pipelines
Understanding Supervised learning
Understanding convolutional neural networks
Using neural networks for text classification
Using deep neural networks for language processing
Understanding Recurrent Neural Networks
Introducing autoencoders
Summary
Chapter 11: Tuning Spark SQL Components for Performance
Introducing performance tuning in Spark SQL
Understanding DataFrame/Dataset APIs
Optimizing data serialization
Understanding Catalyst optimizations
Understanding the Dataset/DataFrame API
Understanding Catalyst transformations
Visualizing Spark application execution
Exploring Spark application execution metrics
Using external tools for performance tuning
Cost-based optimizer in Apache Spark 2.2.
Understanding the&amp
#160
CBO statistics collection
Statistics collection functions
Filter operator
Join operator
Build side selection
Understanding multi-way JOIN ordering optimization
Understanding performance improvements using whole-stage code generation
Summary
Chapter 12: Spark SQL in Large-Scale Application Architectures
Understanding Spark-based application architectures
Using Apache Spark for batch processing
Using Apache Spark for stream processing
Understanding the Lambda architecture
Understanding the Kappa Architecture
Design considerations for building scalable stream processing applications
Building robust ETL pipelines using Spark SQL
Choosing appropriate data formats
Transforming data in ETL pipelines
Addressing errors in ETL pipelines
Implementing a scalable monitoring solution
Deploying Spark machine learning pipelines
Understanding the challenges in typical ML deployment environments
Understanding types of model scoring architectures
Using cluster managers
Summary
Index.

Learning Spark SQL architect streaming analytics and machine learning solutions

Ejemplares similares