Spark in action
Otros Autores: | , |
---|---|
Formato: | Libro electrónico |
Idioma: | Inglés |
Publicado: |
Shelter Island, New York :
Manning
[2020]
|
Materias: | |
Ver en Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009633579706719 |
Tabla de Contenidos:
- Intro
- Copyright
- brief contents
- contents
- front matter
- foreword
- The analytics operating system
- preface
- acknowledgments
- about this book
- Who should read this book
- What will you learn in this book?
- How this book is organized
- About the code
- liveBook discussion forum
- about the author
- about the cover illustration
- Part 1. The theory crippled by awesome examples
- 1. So, what is Spark, anyway?
- 1.1 The big picture: What Spark is and what it does
- 1.1.1 What is Spark?
- 1.1.2 The four pillars of mana
- 1.2 How can you use Spark?
- 1.2.1 Spark in a data processing/engineering scenario
- 1.2.2 Spark in a data science scenario
- 1.3 What can you do with Spark?
- 1.3.1 Spark predicts restaurant quality at NC eateries
- 1.3.2 Spark allows fast data transfer for Lumeris
- 1.3.3 Spark analyzes equipment logs for CERN
- 1.3.4 Other use cases
- 1.4 Why you will love the dataframe
- 1.4.1 The dataframe from a Java perspective
- 1.4.2 The dataframe from an RDBMS perspective
- 1.4.3 A graphical representation of the dataframe
- 1.5 Your first example
- 1.5.1 Recommended software
- 1.5.2 Downloading the code
- 1.5.3 Running your first application
- Command line
- Eclipse
- 1.5.4 Your first code
- Summary
- 2. Architecture and flow
- 2.1 Building your mental model
- 2.2 Using Java code to build your mental model
- 2.3 Walking through your application
- 2.3.1 Connecting to a master
- 2.3.2 Loading, or ingesting, the CSV file
- 2.3.3 Transforming your data
- 2.3.4 Saving the work done in your dataframe to a database
- Summary
- 3. The majestic role of the dataframe
- 3.1 The essential role of the dataframe in Spark
- 3.1.1 Organization of a dataframe
- 3.1.2 Immutability is not a swear word
- 3.2 Using dataframes through examples
- 3.2.1 A dataframe after a simple CSV ingestion.
- 3.2.2 Data is stored in partitions
- 3.2.3 Digging in the schema
- 3.2.4 A dataframe after a JSON ingestion
- 3.2.5 Combining two dataframes
- 3.3 The dataframe is a Dataset<
- Row>
- 3.3.1 Reusing your POJOs
- 3.3.2 Creating a dataset of strings
- 3.3.3 Converting back and forth
- Create the dataset
- Create the dataframe
- 3.4 Dataframe's ancestor: the RDD
- Summary
- 4. Fundamentally lazy
- 4.1 A real-life example of efficient laziness
- 4.2 A Spark example of efficient laziness
- 4.2.1 Looking at the results of transformations and actions
- 4.2.2 The transformation process, step by step
- 4.2.3 The code behind the transformation/action process
- 4.2.4 The mystery behind the creation of 7 million datapoints in 182 ms
- The mystery behind the timing of actions
- 4.3 Comparing to RDBMS and traditional applications
- 4.3.1 Working with the teen birth rates dataset
- 4.3.2 Analyzing differences between a traditional app and a Spark app
- 4.4 Spark is amazing for data-focused applications
- 4.5 Catalyst is your app catalyzer
- Summary
- 5. Building a simple app for deployment
- 5.1 An ingestionless example
- 5.1.1 Calculating π
- 5.1.2 The code to approximate π
- 5.1.3 What are lambda functions in Java?
- 5.1.4 Approximating π by using lambda functions
- 5.2 Interacting with Spark
- 5.2.1 Local mode
- 5.2.2 Cluster mode
- Submitting a job to Spark
- Setting the cluster's master in your application
- 5.2.3 Interactive mode in Scala and Python
- Scala shell
- Python shell
- Summary
- 6. Deploying your simple app
- 6.1 Beyond the example: The role of the components
- 6.1.1 Quick overview of the components and their interactions
- 6.1.2 Troubleshooting tips for the Spark architecture
- 6.1.3 Going further
- 6.2 Building a cluster
- 6.2.1 Building a cluster that works for you.
- 6.2.2 Setting up the environment
- 6.3 Building your application to run on the cluster
- 6.3.1 Building your application's uber JAR
- 6.3.2 Building your application by using Git and Maven
- 6.4 Running your application on the cluster
- 6.4.1 Submitting the uber JAR
- 6.4.2 Running the application
- 6.4.3 the Spark user interface
- Summary
- Part 2. Ingestion
- 7. Ingestion from files
- 7.1 Common behaviors of parsers
- 7.2 Complex ingestion from CSV
- 7.2.1 Desired output
- 7.2.2 Code
- 7.3 Ingesting a CSV with a known schema
- 7.3.1 Desired output
- 7.3.2 Code
- 7.4 Ingesting a JSON file
- 7.4.1 Desired output
- 7.4.2 Code
- 7.5 Ingesting a multiline JSON file
- 7.5.1 Desired output
- 7.5.2 Code
- 7.6 Ingesting an XML file
- 7.6.1 Desired output
- 7.6.2 Code
- 7.7 Ingesting a text file
- 7.7.1 Desired output
- 7.7.2 Code
- 7.8 File formats for big data
- 7.8.1 The problem with traditional file formats
- 7.8.2 Avro is a schema-based serialization format
- 7.8.3 ORC is a columnar storage format
- 7.8.4 Parquet is also a columnar storage format
- 7.8.5 Comparing Avro, ORC, and Parquet
- 7.9 Ingesting Avro, ORC, and Parquet files
- 7.9.1 Ingesting Avro
- 7.9.2 Ingesting ORC
- 7.9.3 Ingesting Parquet
- 7.9.4 Reference table for ingesting Avro, ORC, or Parquet
- Summary
- 8. Ingestion from databases
- 8.1 Ingestion from relational databases
- 8.1.1 Database connection checklist
- 8.1.2 Understanding the data used in the examples
- 8.1.3 Desired output
- 8.1.4 Code
- 8.1.5 Alternative code
- 8.2 The role of the dialect
- 8.2.1 What is a dialect, anyway?
- 8.2.2 JDBC dialects provided with Spark
- 8.2.3 Building your own dialect
- 8.3 Advanced queries and ingestion
- 8.3.1 Filtering by using a WHERE clause
- 8.3.2 Joining data in the database
- 8.3.3 Performing Ingestion and partitioning.
- 8.3.4 Summary of advanced features
- 8.4 Ingestion from Elasticsearch
- 8.4.1 Data flow
- 8.4.2 The New York restaurants dataset digested by Spark
- 8.4.3 Code to ingest the restaurant dataset from Elasticsearch
- Summary
- 9 Advanced ingestion: finding data sources and building your own
- 9.1 What is a data source?
- 9.2 Benefits of a direct connection to a data source
- 9.2.1 Temporary files
- 9.2.2 Data quality scripts
- 9.2.3 Data on demand
- 9.3 Finding data sources at Spark Packages
- 9.4 Building your own data source
- 9.4.1 Scope of the example project
- 9.4.2 Your data source API and options
- 9.5 Behind the scenes: Building the data source itself
- 9.6 Using the register file and the advertiser class
- 9.7 Understanding the relationship between the data and schema
- 9.7.1 The data source builds the relation
- 9.7.2 Inside the relation
- 9.8 Building the schema from a JavaBean
- 9.9 Building the dataframe is magic with the utilities
- 9.10 The other classes
- Summary
- 10. Ingestion through structured streaming
- 10.1 What's streaming?
- 10.2 Creating your first stream
- 10.2.1 Generating a file stream
- 10.2.2 Consuming the records
- 10.2.3 Getting records, not lines
- 10.3 Ingesting data from network streams
- 10.4 Dealing with multiple streams
- 10.5 Differentiating discretized and structured streaming
- Summary
- Part 3. Transforming your data
- 11. Working with SQL
- 11.1 Working with Spark SQL
- 11.2 The difference between local and global views
- 11.3 Mixing the dataframe API and Spark SQL
- 11.4 Don't DELETE it!
- 11.5 Going further with SQL
- Summary
- 12 Transforming your data
- 12.1 What is data transformation?
- 12.2 Process and example of record-level transformation
- 12.2.1 Data discovery to understand the complexity
- 12.2.2 Data mapping to draw the process.
- 12.2.3 Writing the transformation code
- 12.2.4 Reviewing your data transformation to ensure a quality process
- What about sorting?
- Wrapping up your first Spark transformation
- 12.3 Joining datasets
- 12.3.1 A closer look at the datasets to join
- 12.3.2 Building the list of higher education institutions per county
- Initialization of Spark
- Loading and preparing the data
- 12.3.3 Performing the joins
- Joining the FIPS county identifier with the higher ed dataset using a join
- Joining the census data to get the county name
- 12.4 Performing more transformations
- Summary
- 13. Transforming entire documents
- 13.1 Transforming entire documents and their structure
- 13.1.1 Flattening your JSON document
- 13.1.2 Building nested documents for transfer and storage
- 13.2 The magic behind static functions
- 13.3 Performing more transformations
- Summary
- 14. Extending transformations with user-defined functions
- 14.1 Extending Apache Spark
- 14.2 Registering and calling a UDF
- 14.2.1 Registering the UDF with Spark
- 14.2.2 Using the UDF with the dataframe API
- 14.2.3 Manipulating UDFs with SQL
- 14.2.4 Implementing the UDF
- 14.2.5 Writing the service itself
- 14.3 Using UDFs to ensure a high level of data quality
- 14.4 Considering UDFs' constraints
- Summary
- 15. Aggregating your data
- 15.1 Aggregating data with Spark
- 15.1.1 A quick reminder on aggregations
- 15.1.2 Performing basic aggregations with Spark
- Performing an aggregation using the dataframe API
- Performing an aggregation using Spark SQL
- 15.2 Performing aggregations with live data
- 15.2.1 Preparing your dataset
- 15.2.2 Aggregating data to better understand the schools
- What is the average enrollment for each school?
- What is the evolution of the number of students?
- What is the higher enrollment per school and year?.
- What is the minimal absenteeism per school?.