Big data analytics a handy reference guide for data analysts and data scientists to help obtain value from big data analytics using Spark on Hadoop clusters

A handy reference guide for data analysts and data scientists to help to obtain value from big data analytics using Spark on Hadoop clusters About This Book This book is based on the latest 2.0 version of Apache Spark and 2.7 version of Hadoop integrated with most commonly used tools. Learn all Spar...

Descripción completa

Detalles Bibliográficos
Otros Autores: Ankam, Venkat, author (author)
Formato: Libro electrónico
Idioma:Inglés
Publicado: Birmingham, England : Packt Publishing 2016.
Edición:1st edition
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630237506719
Tabla de Contenidos:
  • Cover
  • Copyright
  • Credits
  • About the Author
  • Acknowledgement
  • About the Reviewers
  • www.PacktPub.com
  • Preface
  • Chapter 1: Big Data Analytics at a 10,000-Foot View
  • Big Data analytics and the role of Hadoop and Spark
  • A typical Big Data analytics project life cycle
  • Identifying the problem and outcomes
  • Identifying the necessary data
  • Data collection
  • Preprocessing data and ETL
  • Performing analytics
  • Visualizing data
  • The role of Hadoop and Spark
  • Big Data science and the role of Hadoop and Spark
  • A fundamental shift from data analytics to data science
  • Data scientists versus software engineers
  • Data scientists versus data analysts
  • Data scientists versus business analysts
  • A typical data science project life cycle
  • Hypothesis and modeling
  • Measuring the effectiveness
  • Making improvements
  • Communicating the results
  • The role of Hadoop and Spark
  • Tools and techniques
  • Real-life use cases
  • Summary
  • Chapter 2: Getting Started with Apache Hadoop and Apache Spark
  • Introducing Apache Hadoop
  • Hadoop Distributed File System
  • Features of HDFS
  • MapReduce
  • MapReduce features
  • MapReduce v1 versus MapReduce v2
  • MapReduce v1 challenges
  • YARN
  • Storage options on Hadoop
  • File formats
  • Compression formats
  • Introducing Apache Spark
  • Spark history
  • What is Apache Spark?
  • What Apache Spark is not
  • MapReduce issues
  • Spark's stack
  • Why Hadoop plus Spark?
  • Hadoop features
  • Spark features
  • Frequently asked questions about Spark
  • Installing Hadoop plus Spark clusters
  • Summary
  • Chapter 3: Deep Dive into Apache Spark
  • Starting Spark daemons
  • Working with CDH
  • Working with HDP, MapR, and Spark pre-built packages
  • Learning Spark core concepts
  • Ways to work with Spark
  • Spark Shell
  • Spark applications
  • Resilient Distributed Dataset.
  • Method 1 - parallelizing a collection
  • Method 2 - reading from a file
  • Spark context
  • Transformations and actions
  • Parallelism in RDDs
  • Lazy evaluation
  • Lineage Graph
  • Serialization
  • Leveraging Hadoop file formats in Spark
  • Data locality
  • Shared variables
  • Pair RDDs
  • Lifecycle of Spark program
  • Pipelining
  • Spark execution summary
  • Spark applications
  • Spark Shell versus Spark applications
  • Creating a Spark context
  • SparkConf
  • SparkSubmit
  • Spark Conf precedence order
  • Important application configurations
  • Persistence and caching
  • Storage levels
  • What level to choose?
  • Spark resource managers - Standalone, YARN, and Mesos
  • Local versus cluster mode
  • Cluster resource managers
  • Standalone
  • YARN
  • Mesos
  • Which resource manager to use?
  • Summary
  • Chapter 4: Big Data Analytics with Spark SQL, DataFrames, and Datasets
  • History of Spark SQL
  • Architecture of Spark SQL
  • Introducing SQL, Datasources, DataFrame, and Dataset APIs
  • Evolution of DataFrames and Datasets
  • What's wrong with RDDs?
  • RDD Transformations versus Dataset and DataFrames Transformations
  • Why Datasets and DataFrames?
  • Optimization
  • Speed
  • Automatic Schema Discovery
  • Multiple sources, multiple languages
  • Interoperability between RDDs and others
  • Select and read necessary data only
  • When to use RDDs, Datasets, and DataFrames?
  • Analytics with DataFrames
  • Creating SparkSession
  • Creating DataFrames
  • Creating DataFrames from structured data files
  • Creating DataFrames from RDDs
  • Creating DataFrames from tables in Hive
  • Creating DataFrames from external databases
  • Converting DataFrames to RDDs
  • Common Dataset/DataFrame operations
  • Input and Output Operations
  • Basic Dataset/DataFrame functions
  • DSL functions
  • Built-in functions, aggregate functions, and window functions
  • Actions.
  • RDD operations
  • Caching data
  • Performance optimizations
  • Analytics with the Dataset API
  • Creating Datasets
  • Converting a DataFrame to a Dataset
  • Converting a Dataset to a DataFrame
  • Accessing metadata using Catalog
  • Data Sources API
  • Read and write functions
  • Built-in sources
  • Working with text files
  • Working with JSON
  • Working with Parquet
  • Working with ORC
  • Working with JDBC
  • Working with CSV
  • External sources
  • Working with AVRO
  • Working with XML
  • Working with Pandas
  • DataFrame based Spark-on-HBase connector
  • Spark SQL as a distributed SQL engine
  • Spark SQL's Thrift server for JDBC/ODBC access
  • Querying data using beeline client
  • Querying data from Hive using spark-sql CLI
  • Integration with BI tools
  • Hive on Spark
  • Summary
  • Chapter 5: Real-Time Analytics with Spark Streaming and Structured Streaming
  • Introducing real-time processing
  • Pros and cons of Spark Streaming
  • History of Spark Streaming
  • Architecture of Spark Streaming
  • Spark Streaming application flow
  • Stateless and stateful stream processing
  • Spark Streaming transformations and actions
  • Union
  • Join
  • Transform operation
  • updateStateByKey
  • mapWithState
  • Window operations
  • Output operations
  • Input sources and output stores
  • Basic sources
  • Advanced sources
  • Custom sources
  • Receiver reliability
  • Output stores
  • Spark Streaming with Kafka and HBase
  • Receiver-based approach
  • Role of Zookeeper
  • Direct approach (no receivers)
  • Integration with HBase
  • Advanced concepts of Spark Streaming
  • Using DataFrames
  • MLlib operations
  • Caching/persistence
  • Fault-tolerance in Spark Streaming
  • Failure of executor
  • Failure of driver
  • Performance tuning of Spark Streaming applications
  • Monitoring applications
  • Introducing Structured Streaming
  • Structured Streaming application flow.
  • When to use Structured Streaming?
  • Streaming Datasets and Streaming DataFrames
  • Input sources and output sinks
  • Operations on Streaming Datasets and Streaming DataFrames
  • Summary
  • Chapter 6: Notebooks and Dataflows with Spark and Hadoop
  • Introducing web-based notebooks
  • Introducing Jupyter
  • Installing Jupyter
  • Analytics with Jupyter
  • Introducing Apache Zeppelin
  • Jupyter versus Zeppelin
  • Installing Apache Zeppelin
  • Ambari service
  • The manual method
  • Analytics with Zeppelin
  • The Livy REST job server and Hue Notebooks
  • Installing and configuring the Livy server and Hue
  • Using the Livy server
  • An interactive session
  • A batch session
  • Sharing SparkContexts and RDDs
  • Using Livy with Hue Notebook
  • Using Livy with Zeppelin
  • Introducing Apache NiFi for dataflows
  • Installing Apache NiFi
  • Dataflows and analytics with NiFi
  • Summary
  • Chapter 7: Machine Learning with Spark and Hadoop
  • Introducing machine learning
  • Machine learning on Spark and Hadoop
  • Machine learning algorithms
  • Supervised learning
  • Unsupervised learning
  • Recommender systems
  • Feature extraction and transformation
  • Optimization
  • Spark MLlib data types
  • An example of machine learning algorithms
  • Logistic regression for spam detection
  • Building machine learning pipelines
  • An example of a pipeline workflow
  • Building an ML pipeline
  • Saving and loading models
  • Machine learning with H2O and Spark
  • Why Sparkling Water?
  • An application flow on YARN
  • Getting started with Sparkling Water
  • Introducing Hivemall
  • Introducing Hivemall for Spark
  • Summary
  • Chapter 8: Building Recommendation Systems with Spark and Mahout
  • Building recommendation systems
  • Content-based filtering
  • Collaborative filtering
  • User-based collaborative filtering
  • Item-based collaborative filtering.
  • Limitations of a recommendation system
  • A recommendation system with MLlib
  • Preparing the environment
  • Creating RDDs
  • Exploring the data with DataFrames
  • Creating training and testing datasets
  • Creating a model
  • Making predictions
  • Evaluating the model with testing data
  • Checking the accuracy of the model
  • Explicit versus implicit feedback
  • The Mahout and Spark integration
  • Installing Mahout
  • Exploring the Mahout shell
  • Building a universal recommendation system with Mahout and search tool
  • Summary
  • Chapter 9: Graph Analytics with GraphX
  • Introducing graph processing
  • What is a graph?
  • Graph databases versus graph processing systems
  • Introducing GraphX
  • Graph algorithms
  • Getting started with GraphX
  • Basic operations of GraphX
  • Creating a graph
  • Counting
  • Filtering
  • inDegrees, outDegrees, and degrees
  • Triplets
  • Transforming graphs
  • Transforming attributes
  • Modifying graphs
  • Joining graphs
  • VertexRDD and EdgeRDD operations
  • GraphX algorithms
  • Triangle counting
  • Connected components
  • Analyzing flight data using GraphX
  • Pregel API
  • Introducing GraphFrames
  • Motif finding
  • Loading and saving GraphFrames
  • Summary
  • Chapter 10: Interactive Analytics with SparkR
  • Introducing R and SparkR
  • What is R?
  • Introducing SparkR
  • Architecture of SparkR
  • Getting started with SparkR
  • Installing and configuring R
  • Using SparkR shell
  • Local mode
  • Standalone mode
  • Yarn mode
  • Creating a local DataFrame
  • Creating a DataFrame from a DataSources API
  • Creating a DataFrame from Hive
  • Using SparkR scripts
  • Using DataFrames with SparkR
  • Using SparkR with RStudio
  • Machine learning with SparkR
  • Using the Naive Bayes model
  • Using the k-means model
  • Using SparkR with Zeppelin
  • Summary
  • Index.