Apache Spark 2.x for Java developers explore data at scale using the Java APIs of Apache Spark 2.x
Unleash the data processing and analytics capability of Apache Spark with the language of choice: Java About This Book Perform big data processing with Spark—without having to learn Scala! Use the Spark Java API to implement efficient enterprise-grade applications for data processing and analytics G...
Otros Autores: | , |
---|---|
Formato: | Libro electrónico |
Idioma: | Inglés |
Publicado: |
Birmingham, [England] ; Mumbai, [India] :
Packt
2017.
|
Edición: | 1st edition |
Materias: | |
Ver en Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630675006719 |
Tabla de Contenidos:
- Cover
- Copyright
- Credits
- Foreword
- About the Authors
- About the Reviewer
- www.PacktPub.com
- Customer Feedback
- Table of Contents
- Preface
- Chapter 1: Introduction to Spark
- Dimensions of big data
- What makes Hadoop so revolutionary?
- Defining HDFS
- NameNode
- HDFS I/O
- YARN
- Processing the flow of application submission in YARN
- Overview of MapReduce
- Why Apache Spark?
- RDD - the first citizen of Spark
- Operations on RDD
- Lazy evaluation
- Benefits of RDD
- Exploring the Spark ecosystem
- What's new in Spark 2.X?
- References
- Summary
- Chapter 2: Revisiting Java
- Why use Java for Spark?
- Generics
- Creating your own generic type
- Interfaces
- Static method in an interface
- Default method in interface
- What if a class implements two interfaces which have default methods with same name and signature?
- Anonymous inner classes
- Lambda expressions
- Functional interface
- Syntax of Lambda expressions
- Lexical scoping
- Method reference
- Understanding closures
- Streams
- Generating streams
- Intermediate operations
- Working with intermediate operations
- Terminal operations
- Working with terminal operations
- String collectors
- Collection collectors
- Map collectors
- Groupings
- Partitioning
- Matching
- Finding elements
- Summary
- Chapter 3: Let Us Spark
- Getting started with Spark
- Spark REPL also known as CLI
- Some basic exercises using Spark shell
- Checking Spark version
- Creating and filtering RDD
- Word count on RDD
- Finding the sum of all even numbers in an RDD of integers
- Counting the number of words in a file
- Spark components
- Spark Driver Web UI
- Jobs
- Stages
- Storage
- Environment
- Executors
- SQL
- Streaming
- Spark job configuration and submission
- Spark REST APIs
- Summary.
- Chapter 4: Understanding the Spark Programming Model
- Hello Spark
- Prerequisites
- Common RDD transformations
- Map
- Filter
- flatMap
- mapToPair
- flatMapToPair
- union
- Intersection
- Distinct
- Cartesian
- groupByKey
- reduceByKey
- sortByKey
- Join
- CoGroup
- Common RDD actions
- isEmpty
- collect
- collectAsMap
- count
- countByKey
- countByValue
- Max
- Min
- First
- Take
- takeOrdered
- takeSample
- top
- reduce
- Fold
- aggregate
- forEach
- saveAsTextFile
- saveAsObjectFile
- RDD persistence and cache
- Summary
- Chapter 5: Working with Data and Storage
- Interaction with external storage systems
- Interaction with local filesystem
- Interaction with Amazon S3
- Interaction with HDFS
- Interaction with Cassandra
- Working with different data formats
- Plain and specially formatted text
- Working with CSV data
- Working with JSON data
- Working with XML Data
- References
- Summary
- Chapter 6: Spark on Cluster
- Spark application in distributed-mode
- Driver program
- Executor program
- Cluster managers
- Spark standalone
- Installation of Spark standalone cluster
- Start master
- Start slave
- Stop master and slaves
- Deploying applications on Spark standalone cluster
- Client mode
- Cluster mode
- Useful job configurations
- Useful cluster level configurations (Spark standalone)
- Yet Another Resource Negotiator (YARN)
- YARN client
- YARN cluster
- Useful job configuration
- Summary
- Chapter 7: Spark Programming Model - Advanced
- RDD partitioning
- Repartitioning
- How Spark calculates the partition count for transformations with shuffling (wide transformations )
- Partitioner
- Hash Partitioner
- Range Partitioner
- Custom Partitioner
- Advanced transformations
- mapPartitions
- mapPartitionsWithIndex
- mapPartitionsToPair
- mapValues.
- flatMapValues
- repartitionAndSortWithinPartitions
- coalesce
- foldByKey
- aggregateByKey
- combineByKey
- Advanced actions
- Approximate actions
- Asynchronous actions
- Miscellaneous actions
- Shared variable
- Broadcast variable
- Properties of the broadcast variable
- Lifecycle of a broadcast variable
- Map-side join using broadcast variable
- Accumulators
- Driver program
- Summary
- Chapter 8: Working with Spark SQL
- SQLContext and HiveContext
- Initializing SparkSession
- Reading CSV using SparkSession
- Dataframe and dataset
- SchemaRDD
- Dataframe
- Dataset
- Creating a dataset using encoders
- Creating a dataset using StructType
- Unified dataframe and dataset API
- Data persistence
- Spark SQL operations
- Untyped dataset operation
- Temporary view
- Global temporary view
- Spark UDF
- Spark UDAF
- Untyped UDAF
- Type-safe UDAF:
- Hive integration
- Table Persistence
- Summary
- Chapter 9: Near Real-Time Processing with Spark Streaming
- Introducing Spark Streaming
- Understanding micro batching
- Getting started with Spark Streaming jobs
- Streaming sources
- fileStream
- Kafka
- Streaming transformations
- Stateless transformation
- Stateful transformation
- Checkpointing
- Windowing
- Transform operation
- Fault tolerance and reliability
- Data receiver stage
- File streams
- Advanced streaming sources
- Transformation stage
- Output stage
- Structured Streaming
- Recap of the use case
- Structured streaming - programming model
- Built-in input sources and sinks
- Input sources
- Built-in Sinks
- Summary
- Chapter 10: Machine Learning Analytics with Spark MLlib
- Introduction to machine learning
- Concepts of machine learning
- Datatypes
- Machine learning work flow
- Pipelines
- Operations on feature vectors
- Feature extractors
- Feature transformers.
- Feature selectors
- Summary
- Chapter 11: Learning Spark GraphX
- Introduction to GraphX
- Introduction to Property Graph
- Getting started with the GraphX API
- Using vertex and edge RDDs
- From edges
- EdgeTriplet
- Graph operations
- mapVertices
- mapEdges
- mapTriplets
- reverse
- subgraph
- aggregateMessages
- outerJoinVertices
- Graph algorithms
- PageRank
- Static PageRank
- Dynamic PageRank
- Triangle counting
- Connected components
- Summary
- Index.