Big data analytics a handy reference guide for data analysts and data scientists to help obtain value from big data analytics using Spark on Hadoop clusters

A handy reference guide for data analysts and data scientists to help to obtain value from big data analytics using Spark on Hadoop clusters About This Book This book is based on the latest 2.0 version of Apache Spark and 2.7 version of Hadoop integrated with most commonly used tools. Learn all Spar...

Descripción completa

Detalles Bibliográficos
Otros Autores:	Ankam, Venkat, author (author)
Formato:	Libro electrónico
Idioma:	Inglés
Publicado:	Birmingham, England : Packt Publishing 2016.
Edición:	1st edition
Materias:	Big data > Security measures.
Ver en Biblioteca Universitat Ramon Llull:	https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630237506719

Tabla de Contenidos:

Cover
Copyright
Credits
About the Author
Acknowledgement
About the Reviewers
www.PacktPub.com
Preface
Chapter 1: Big Data Analytics at a 10,000-Foot View
Big Data analytics and the role of Hadoop and Spark
A typical Big Data analytics project life cycle
Identifying the problem and outcomes
Identifying the necessary data
Data collection
Preprocessing data and ETL
Performing analytics
Visualizing data
The role of Hadoop and Spark
Big Data science and the role of Hadoop and Spark
A fundamental shift from data analytics to data science
Data scientists versus software engineers
Data scientists versus data analysts
Data scientists versus business analysts
A typical data science project life cycle
Hypothesis and modeling
Measuring the effectiveness
Making improvements
Communicating the results
The role of Hadoop and Spark
Tools and techniques
Real-life use cases
Summary
Chapter 2: Getting Started with Apache Hadoop and Apache Spark
Introducing Apache Hadoop
Hadoop Distributed File System
Features of HDFS
MapReduce
MapReduce features
MapReduce v1 versus MapReduce v2
MapReduce v1 challenges
YARN
Storage options on Hadoop
File formats
Compression formats
Introducing Apache Spark
Spark history
What is Apache Spark?
What Apache Spark is not
MapReduce issues
Spark's stack
Why Hadoop plus Spark?
Hadoop features
Spark features
Frequently asked questions about Spark
Installing Hadoop plus Spark clusters
Summary
Chapter 3: Deep Dive into Apache Spark
Starting Spark daemons
Working with CDH
Working with HDP, MapR, and Spark pre-built packages
Learning Spark core concepts
Ways to work with Spark
Spark Shell
Spark applications
Resilient Distributed Dataset.
Method 1 - parallelizing a collection
Method 2 - reading from a file
Spark context
Transformations and actions
Parallelism in RDDs
Lazy evaluation
Lineage Graph
Serialization
Leveraging Hadoop file formats in Spark
Data locality
Shared variables
Pair RDDs
Lifecycle of Spark program
Pipelining
Spark execution summary
Spark applications
Spark Shell versus Spark applications
Creating a Spark context
SparkConf
SparkSubmit
Spark Conf precedence order
Important application configurations
Persistence and caching
Storage levels
What level to choose?
Spark resource managers - Standalone, YARN, and Mesos
Local versus cluster mode
Cluster resource managers
Standalone
YARN
Mesos
Which resource manager to use?
Summary
Chapter 4: Big Data Analytics with Spark SQL, DataFrames, and Datasets
History of Spark SQL
Architecture of Spark SQL
Introducing SQL, Datasources, DataFrame, and Dataset APIs
Evolution of DataFrames and Datasets
What's wrong with RDDs?
RDD Transformations versus Dataset and DataFrames Transformations
Why Datasets and DataFrames?
Optimization
Speed
Automatic Schema Discovery
Multiple sources, multiple languages
Interoperability between RDDs and others
Select and read necessary data only
When to use RDDs, Datasets, and DataFrames?
Analytics with DataFrames
Creating SparkSession
Creating DataFrames
Creating DataFrames from structured data files
Creating DataFrames from RDDs
Creating DataFrames from tables in Hive
Creating DataFrames from external databases
Converting DataFrames to RDDs
Common Dataset/DataFrame operations
Input and Output Operations
Basic Dataset/DataFrame functions
DSL functions
Built-in functions, aggregate functions, and window functions
Actions.
RDD operations
Caching data
Performance optimizations
Analytics with the Dataset API
Creating Datasets
Converting a DataFrame to a Dataset
Converting a Dataset to a DataFrame
Accessing metadata using Catalog
Data Sources API
Read and write functions
Built-in sources
Working with text files
Working with JSON
Working with Parquet
Working with ORC
Working with JDBC
Working with CSV
External sources
Working with AVRO
Working with XML
Working with Pandas
DataFrame based Spark-on-HBase connector
Spark SQL as a distributed SQL engine
Spark SQL's Thrift server for JDBC/ODBC access
Querying data using beeline client
Querying data from Hive using spark-sql CLI
Integration with BI tools
Hive on Spark
Summary
Chapter 5: Real-Time Analytics with Spark Streaming and Structured Streaming
Introducing real-time processing
Pros and cons of Spark Streaming
History of Spark Streaming
Architecture of Spark Streaming
Spark Streaming application flow
Stateless and stateful stream processing
Spark Streaming transformations and actions
Union
Join
Transform operation
updateStateByKey
mapWithState
Window operations
Output operations
Input sources and output stores
Basic sources
Advanced sources
Custom sources
Receiver reliability
Output stores
Spark Streaming with Kafka and HBase
Receiver-based approach
Role of Zookeeper
Direct approach (no receivers)
Integration with HBase
Advanced concepts of Spark Streaming
Using DataFrames
MLlib operations
Caching/persistence
Fault-tolerance in Spark Streaming
Failure of executor
Failure of driver
Performance tuning of Spark Streaming applications
Monitoring applications
Introducing Structured Streaming
Structured Streaming application flow.
When to use Structured Streaming?
Streaming Datasets and Streaming DataFrames
Input sources and output sinks
Operations on Streaming Datasets and Streaming DataFrames
Summary
Chapter 6: Notebooks and Dataflows with Spark and Hadoop
Introducing web-based notebooks
Introducing Jupyter
Installing Jupyter
Analytics with Jupyter
Introducing Apache Zeppelin
Jupyter versus Zeppelin
Installing Apache Zeppelin
Ambari service
The manual method
Analytics with Zeppelin
The Livy REST job server and Hue Notebooks
Installing and configuring the Livy server and Hue
Using the Livy server
An interactive session
A batch session
Sharing SparkContexts and RDDs
Using Livy with Hue Notebook
Using Livy with Zeppelin
Introducing Apache NiFi for dataflows
Installing Apache NiFi
Dataflows and analytics with NiFi
Summary
Chapter 7: Machine Learning with Spark and Hadoop
Introducing machine learning
Machine learning on Spark and Hadoop
Machine learning algorithms
Supervised learning
Unsupervised learning
Recommender systems
Feature extraction and transformation
Optimization
Spark MLlib data types
An example of machine learning algorithms
Logistic regression for spam detection
Building machine learning pipelines
An example of a pipeline workflow
Building an ML pipeline
Saving and loading models
Machine learning with H2O and Spark
Why Sparkling Water?
An application flow on YARN
Getting started with Sparkling Water
Introducing Hivemall
Introducing Hivemall for Spark
Summary
Chapter 8: Building Recommendation Systems with Spark and Mahout
Building recommendation systems
Content-based filtering
Collaborative filtering
User-based collaborative filtering
Item-based collaborative filtering.
Limitations of a recommendation system
A recommendation system with MLlib
Preparing the environment
Creating RDDs
Exploring the data with DataFrames
Creating training and testing datasets
Creating a model
Making predictions
Evaluating the model with testing data
Checking the accuracy of the model
Explicit versus implicit feedback
The Mahout and Spark integration
Installing Mahout
Exploring the Mahout shell
Building a universal recommendation system with Mahout and search tool
Summary
Chapter 9: Graph Analytics with GraphX
Introducing graph processing
What is a graph?
Graph databases versus graph processing systems
Introducing GraphX
Graph algorithms
Getting started with GraphX
Basic operations of GraphX
Creating a graph
Counting
Filtering
inDegrees, outDegrees, and degrees
Triplets
Transforming graphs
Transforming attributes
Modifying graphs
Joining graphs
VertexRDD and EdgeRDD operations
GraphX algorithms
Triangle counting
Connected components
Analyzing flight data using GraphX
Pregel API
Introducing GraphFrames
Motif finding
Loading and saving GraphFrames
Summary
Chapter 10: Interactive Analytics with SparkR
Introducing R and SparkR
What is R?
Introducing SparkR
Architecture of SparkR
Getting started with SparkR
Installing and configuring R
Using SparkR shell
Local mode
Standalone mode
Yarn mode
Creating a local DataFrame
Creating a DataFrame from a DataSources API
Creating a DataFrame from Hive
Using SparkR scripts
Using DataFrames with SparkR
Using SparkR with RStudio
Machine learning with SparkR
Using the Naive Bayes model
Using the k-means model
Using SparkR with Zeppelin
Summary
Index.

Big data analytics a handy reference guide for data analysts and data scientists to help obtain value from big data analytics using Spark on Hadoop clusters

Ejemplares similares