Spark in action

Detalles Bibliográficos
Otros Autores:	Perrin, Jean-Georges , 1971- author (author), Thomas, Rob (Information technology executive), writer of foreword (writer of foreword)
Formato:	Libro electrónico
Idioma:	Inglés
Publicado:	Shelter Island, New York : Manning [2020]
Materias:	Spark (Electronic resource : Apache Software Foundation) Big data. Data mining > Computer programs.
Ver en Biblioteca Universitat Ramon Llull:	https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009633579706719

Tabla de Contenidos:

Intro
Copyright
brief contents
contents
front matter
foreword
The analytics operating system
preface
acknowledgments
about this book
Who should read this book
What will you learn in this book?
How this book is organized
About the code
liveBook discussion forum
about the author
about the cover illustration
Part 1. The theory crippled by awesome examples
1. So, what is Spark, anyway?
1.1 The big picture: What Spark is and what it does
1.1.1 What is Spark?
1.1.2 The four pillars of mana
1.2 How can you use Spark?
1.2.1 Spark in a data processing/engineering scenario
1.2.2 Spark in a data science scenario
1.3 What can you do with Spark?
1.3.1 Spark predicts restaurant quality at NC eateries
1.3.2 Spark allows fast data transfer for Lumeris
1.3.3 Spark analyzes equipment logs for CERN
1.3.4 Other use cases
1.4 Why you will love the dataframe
1.4.1 The dataframe from a Java perspective
1.4.2 The dataframe from an RDBMS perspective
1.4.3 A graphical representation of the dataframe
1.5 Your first example
1.5.1 Recommended software
1.5.2 Downloading the code
1.5.3 Running your first application
Command line
Eclipse
1.5.4 Your first code
Summary
2. Architecture and flow
2.1 Building your mental model
2.2 Using Java code to build your mental model
2.3 Walking through your application
2.3.1 Connecting to a master
2.3.2 Loading, or ingesting, the CSV file
2.3.3 Transforming your data
2.3.4 Saving the work done in your dataframe to a database
Summary
3. The majestic role of the dataframe
3.1 The essential role of the dataframe in Spark
3.1.1 Organization of a dataframe
3.1.2 Immutability is not a swear word
3.2 Using dataframes through examples
3.2.1 A dataframe after a simple CSV ingestion.
3.2.2 Data is stored in partitions
3.2.3 Digging in the schema
3.2.4 A dataframe after a JSON ingestion
3.2.5 Combining two dataframes
3.3 The dataframe is a Dataset&lt
Row&gt
3.3.1 Reusing your POJOs
3.3.2 Creating a dataset of strings
3.3.3 Converting back and forth
Create the dataset
Create the dataframe
3.4 Dataframe's ancestor: the RDD
Summary
4. Fundamentally lazy
4.1 A real-life example of efficient laziness
4.2 A Spark example of efficient laziness
4.2.1 Looking at the results of transformations and actions
4.2.2 The transformation process, step by step
4.2.3 The code behind the transformation/action process
4.2.4 The mystery behind the creation of 7 million datapoints in 182 ms
The mystery behind the timing of actions
4.3 Comparing to RDBMS and traditional applications
4.3.1 Working with the teen birth rates dataset
4.3.2 Analyzing differences between a traditional app and a Spark app
4.4 Spark is amazing for data-focused applications
4.5 Catalyst is your app catalyzer
Summary
5. Building a simple app for deployment
5.1 An ingestionless example
5.1.1 Calculating π
5.1.2 The code to approximate π
5.1.3 What are lambda functions in Java?
5.1.4 Approximating π by using lambda functions
5.2 Interacting with Spark
5.2.1 Local mode
5.2.2 Cluster mode
Submitting a job to Spark
Setting the cluster's master in your application
5.2.3 Interactive mode in Scala and Python
Scala shell
Python shell
Summary
6. Deploying your simple app
6.1 Beyond the example: The role of the components
6.1.1 Quick overview of the components and their interactions
6.1.2 Troubleshooting tips for the Spark architecture
6.1.3 Going further
6.2 Building a cluster
6.2.1 Building a cluster that works for you.
6.2.2 Setting up the environment
6.3 Building your application to run on the cluster
6.3.1 Building your application's uber JAR
6.3.2 Building your application by using Git and Maven
6.4 Running your application on the cluster
6.4.1 Submitting the uber JAR
6.4.2 Running the application
6.4.3 the Spark user interface
Summary
Part 2. Ingestion
7. Ingestion from files
7.1 Common behaviors of parsers
7.2 Complex ingestion from CSV
7.2.1 Desired output
7.2.2 Code
7.3 Ingesting a CSV with a known schema
7.3.1 Desired output
7.3.2 Code
7.4 Ingesting a JSON file
7.4.1 Desired output
7.4.2 Code
7.5 Ingesting a multiline JSON file
7.5.1 Desired output
7.5.2 Code
7.6 Ingesting an XML file
7.6.1 Desired output
7.6.2 Code
7.7 Ingesting a text file
7.7.1 Desired output
7.7.2 Code
7.8 File formats for big data
7.8.1 The problem with traditional file formats
7.8.2 Avro is a schema-based serialization format
7.8.3 ORC is a columnar storage format
7.8.4 Parquet is also a columnar storage format
7.8.5 Comparing Avro, ORC, and Parquet
7.9 Ingesting Avro, ORC, and Parquet files
7.9.1 Ingesting Avro
7.9.2 Ingesting ORC
7.9.3 Ingesting Parquet
7.9.4 Reference table for ingesting Avro, ORC, or Parquet
Summary
8. Ingestion from databases
8.1 Ingestion from relational databases
8.1.1 Database connection checklist
8.1.2 Understanding the data used in the examples
8.1.3 Desired output
8.1.4 Code
8.1.5 Alternative code
8.2 The role of the dialect
8.2.1 What is a dialect, anyway?
8.2.2 JDBC dialects provided with Spark
8.2.3 Building your own dialect
8.3 Advanced queries and ingestion
8.3.1 Filtering by using a WHERE clause
8.3.2 Joining data in the database
8.3.3 Performing Ingestion and partitioning.
8.3.4 Summary of advanced features
8.4 Ingestion from Elasticsearch
8.4.1 Data flow
8.4.2 The New York restaurants dataset digested by Spark
8.4.3 Code to ingest the restaurant dataset from Elasticsearch
Summary
9 Advanced ingestion: finding data sources and building your own
9.1 What is a data source?
9.2 Benefits of a direct connection to a data source
9.2.1 Temporary files
9.2.2 Data quality scripts
9.2.3 Data on demand
9.3 Finding data sources at Spark Packages
9.4 Building your own data source
9.4.1 Scope of the example project
9.4.2 Your data source API and options
9.5 Behind the scenes: Building the data source itself
9.6 Using the register file and the advertiser class
9.7 Understanding the relationship between the data and schema
9.7.1 The data source builds the relation
9.7.2 Inside the relation
9.8 Building the schema from a JavaBean
9.9 Building the dataframe is magic with the utilities
9.10 The other classes
Summary
10. Ingestion through structured streaming
10.1 What's streaming?
10.2 Creating your first stream
10.2.1 Generating a file stream
10.2.2 Consuming the records
10.2.3 Getting records, not lines
10.3 Ingesting data from network streams
10.4 Dealing with multiple streams
10.5 Differentiating discretized and structured streaming
Summary
Part 3. Transforming your data
11. Working with SQL
11.1 Working with Spark SQL
11.2 The difference between local and global views
11.3 Mixing the dataframe API and Spark SQL
11.4 Don't DELETE it!
11.5 Going further with SQL
Summary
12 Transforming your data
12.1 What is data transformation?
12.2 Process and example of record-level transformation
12.2.1 Data discovery to understand the complexity
12.2.2 Data mapping to draw the process.
12.2.3 Writing the transformation code
12.2.4 Reviewing your data transformation to ensure a quality process
What about sorting?
Wrapping up your first Spark transformation
12.3 Joining datasets
12.3.1 A closer look at the datasets to join
12.3.2 Building the list of higher education institutions per county
Initialization of Spark
Loading and preparing the data
12.3.3 Performing the joins
Joining the FIPS county identifier with the higher ed dataset using a join
Joining the census data to get the county name
12.4 Performing more transformations
Summary
13. Transforming entire documents
13.1 Transforming entire documents and their structure
13.1.1 Flattening your JSON document
13.1.2 Building nested documents for transfer and storage
13.2 The magic behind static functions
13.3 Performing more transformations
Summary
14. Extending transformations with user-defined functions
14.1 Extending Apache Spark
14.2 Registering and calling a UDF
14.2.1 Registering the UDF with Spark
14.2.2 Using the UDF with the dataframe API
14.2.3 Manipulating UDFs with SQL
14.2.4 Implementing the UDF
14.2.5 Writing the service itself
14.3 Using UDFs to ensure a high level of data quality
14.4 Considering UDFs' constraints
Summary
15. Aggregating your data
15.1 Aggregating data with Spark
15.1.1 A quick reminder on aggregations
15.1.2 Performing basic aggregations with Spark
Performing an aggregation using the dataframe API
Performing an aggregation using Spark SQL
15.2 Performing aggregations with live data
15.2.1 Preparing your dataset
15.2.2 Aggregating data to better understand the schools
What is the average enrollment for each school?
What is the evolution of the number of students?
What is the higher enrollment per school and year?.
What is the minimal absenteeism per school?.

Spark in action

Ejemplares similares