Apache Spark 2.x for Java developers explore data at scale using the Java APIs of Apache Spark 2.x

Unleash the data processing and analytics capability of Apache Spark with the language of choice: Java About This Book Perform big data processing with Spark—without having to learn Scala! Use the Spark Java API to implement efficient enterprise-grade applications for data processing and analytics G...

Descripción completa

Detalles Bibliográficos
Otros Autores:	Gulati, Sourav , author (author), Sumit Kumar , author
Formato:	Libro electrónico
Idioma:	Inglés
Publicado:	Birmingham, [England] ; Mumbai, [India] : Packt 2017.
Edición:	1st edition
Materias:	Spark (Electronic resource : Apache Software Foundation) Application program interfaces (Computer software) Java (Computer program language)
Ver en Biblioteca Universitat Ramon Llull:	https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630675006719

Tabla de Contenidos:

Cover
Copyright
Credits
Foreword
About the Authors
About the Reviewer
www.PacktPub.com
Customer Feedback
Table of Contents
Preface
Chapter 1: Introduction to Spark
Dimensions of big data
What makes Hadoop so revolutionary?
Defining HDFS
NameNode
HDFS I/O
YARN
Processing the flow of application submission in YARN
Overview of MapReduce
Why Apache Spark?
RDD - the first citizen of Spark
Operations on RDD
Lazy evaluation
Benefits of RDD
Exploring the Spark ecosystem
What's new in Spark 2.X?
References
Summary
Chapter 2: Revisiting Java
Why use Java for Spark?
Generics
Creating your own generic type
Interfaces
Static method in an interface
Default method in interface
What if a class implements two interfaces which have default methods with same name and signature?
Anonymous inner classes
Lambda expressions
Functional interface
Syntax of Lambda expressions
Lexical scoping
Method reference
Understanding closures
Streams
Generating streams
Intermediate operations
Working with intermediate operations
Terminal operations
Working with terminal operations
String collectors
Collection collectors
Map collectors
Groupings
Partitioning
Matching
Finding elements
Summary
Chapter 3: Let Us Spark
Getting started with Spark
Spark REPL also known as CLI
Some basic exercises using Spark shell
Checking Spark version
Creating and filtering RDD
Word count on RDD
Finding the sum of all even numbers in an RDD of integers
Counting the number of words in a file
Spark components
Spark Driver Web UI
Jobs
Stages
Storage
Environment
Executors
SQL
Streaming
Spark job configuration and submission
Spark REST APIs
Summary.
Chapter 4: Understanding the Spark Programming Model
Hello Spark
Prerequisites
Common RDD transformations
Map
Filter
flatMap
mapToPair
flatMapToPair
union
Intersection
Distinct
Cartesian
groupByKey
reduceByKey
sortByKey
Join
CoGroup
Common RDD actions
isEmpty
collect
collectAsMap
count
countByKey
countByValue
Max
Min
First
Take
takeOrdered
takeSample
top
reduce
Fold
aggregate
forEach
saveAsTextFile
saveAsObjectFile
RDD persistence and cache
Summary
Chapter 5: Working with Data and Storage
Interaction with external storage systems
Interaction with local filesystem
Interaction with Amazon S3
Interaction with HDFS
Interaction with Cassandra
Working with different data formats
Plain and specially formatted text
Working with CSV data
Working with JSON data
Working with XML Data
References
Summary
Chapter 6: Spark on Cluster
Spark application in distributed-mode
Driver program
Executor program
Cluster managers
Spark standalone
Installation of Spark standalone cluster
Start master
Start slave
Stop master and slaves
Deploying applications on Spark standalone cluster
Client mode
Cluster mode
Useful job configurations
Useful cluster level configurations (Spark standalone)
Yet Another Resource Negotiator (YARN)
YARN client
YARN cluster
Useful job configuration
Summary
Chapter 7: Spark Programming Model - Advanced
RDD partitioning
Repartitioning
How Spark calculates the partition count for transformations with shuffling (wide transformations )
Partitioner
Hash Partitioner
Range Partitioner
Custom Partitioner
Advanced transformations
mapPartitions
mapPartitionsWithIndex
mapPartitionsToPair
mapValues.
flatMapValues
repartitionAndSortWithinPartitions
coalesce
foldByKey
aggregateByKey
combineByKey
Advanced actions
Approximate actions
Asynchronous actions
Miscellaneous actions
Shared variable
Broadcast variable
Properties of the broadcast variable
Lifecycle of a broadcast variable
Map-side join using broadcast variable
Accumulators
Driver program
Summary
Chapter 8: Working with Spark SQL
SQLContext and HiveContext
Initializing SparkSession
Reading CSV using SparkSession
Dataframe and dataset
SchemaRDD
Dataframe
Dataset
Creating a dataset using encoders
Creating a dataset using StructType
Unified dataframe and dataset API
Data persistence
Spark SQL operations
Untyped dataset operation
Temporary view
Global temporary view
Spark UDF
Spark UDAF
Untyped UDAF
Type-safe UDAF:
Hive integration
Table Persistence
Summary
Chapter 9: Near Real-Time Processing with Spark Streaming
Introducing Spark Streaming
Understanding micro batching
Getting started with Spark Streaming jobs
Streaming sources
fileStream
Kafka
Streaming transformations
Stateless transformation
Stateful transformation
Checkpointing
Windowing
Transform operation
Fault tolerance and reliability
Data receiver stage
File streams
Advanced streaming sources
Transformation stage
Output stage
Structured Streaming
Recap of the use case
Structured streaming - programming model
Built-in input sources and sinks
Input sources
Built-in Sinks
Summary
Chapter 10: Machine Learning Analytics with Spark MLlib
Introduction to machine learning
Concepts of machine learning
Datatypes
Machine learning work flow
Pipelines
Operations on feature vectors
Feature extractors
Feature transformers.
Feature selectors
Summary
Chapter 11: Learning Spark GraphX
Introduction to GraphX
Introduction to Property Graph
Getting started with the GraphX API
Using vertex and edge RDDs
From edges
EdgeTriplet
Graph operations
mapVertices
mapEdges
mapTriplets
reverse
subgraph
aggregateMessages
outerJoinVertices
Graph algorithms
PageRank
Static PageRank
Dynamic PageRank
Triangle counting
Connected components
Summary
Index.

Apache Spark 2.x for Java developers explore data at scale using the Java APIs of Apache Spark 2.x

Ejemplares similares