Spark in action

Detalles Bibliográficos
Otros Autores: Perrin, Jean-Georges , 1971- author (author), Thomas, Rob (Information technology executive), writer of foreword (writer of foreword)
Formato: Libro electrónico
Idioma:Inglés
Publicado: Shelter Island, New York : Manning [2020]
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009633579706719
Tabla de Contenidos:
  • Intro
  • Copyright
  • brief contents
  • contents
  • front matter
  • foreword
  • The analytics operating system
  • preface
  • acknowledgments
  • about this book
  • Who should read this book
  • What will you learn in this book?
  • How this book is organized
  • About the code
  • liveBook discussion forum
  • about the author
  • about the cover illustration
  • Part 1. The theory crippled by awesome examples
  • 1. So, what is Spark, anyway?
  • 1.1 The big picture: What Spark is and what it does
  • 1.1.1 What is Spark?
  • 1.1.2 The four pillars of mana
  • 1.2 How can you use Spark?
  • 1.2.1 Spark in a data processing/engineering scenario
  • 1.2.2 Spark in a data science scenario
  • 1.3 What can you do with Spark?
  • 1.3.1 Spark predicts restaurant quality at NC eateries
  • 1.3.2 Spark allows fast data transfer for Lumeris
  • 1.3.3 Spark analyzes equipment logs for CERN
  • 1.3.4 Other use cases
  • 1.4 Why you will love the dataframe
  • 1.4.1 The dataframe from a Java perspective
  • 1.4.2 The dataframe from an RDBMS perspective
  • 1.4.3 A graphical representation of the dataframe
  • 1.5 Your first example
  • 1.5.1 Recommended software
  • 1.5.2 Downloading the code
  • 1.5.3 Running your first application
  • Command line
  • Eclipse
  • 1.5.4 Your first code
  • Summary
  • 2. Architecture and flow
  • 2.1 Building your mental model
  • 2.2 Using Java code to build your mental model
  • 2.3 Walking through your application
  • 2.3.1 Connecting to a master
  • 2.3.2 Loading, or ingesting, the CSV file
  • 2.3.3 Transforming your data
  • 2.3.4 Saving the work done in your dataframe to a database
  • Summary
  • 3. The majestic role of the dataframe
  • 3.1 The essential role of the dataframe in Spark
  • 3.1.1 Organization of a dataframe
  • 3.1.2 Immutability is not a swear word
  • 3.2 Using dataframes through examples
  • 3.2.1 A dataframe after a simple CSV ingestion.
  • 3.2.2 Data is stored in partitions
  • 3.2.3 Digging in the schema
  • 3.2.4 A dataframe after a JSON ingestion
  • 3.2.5 Combining two dataframes
  • 3.3 The dataframe is a Dataset&lt
  • Row&gt
  • 3.3.1 Reusing your POJOs
  • 3.3.2 Creating a dataset of strings
  • 3.3.3 Converting back and forth
  • Create the dataset
  • Create the dataframe
  • 3.4 Dataframe's ancestor: the RDD
  • Summary
  • 4. Fundamentally lazy
  • 4.1 A real-life example of efficient laziness
  • 4.2 A Spark example of efficient laziness
  • 4.2.1 Looking at the results of transformations and actions
  • 4.2.2 The transformation process, step by step
  • 4.2.3 The code behind the transformation/action process
  • 4.2.4 The mystery behind the creation of 7 million datapoints in 182 ms
  • The mystery behind the timing of actions
  • 4.3 Comparing to RDBMS and traditional applications
  • 4.3.1 Working with the teen birth rates dataset
  • 4.3.2 Analyzing differences between a traditional app and a Spark app
  • 4.4 Spark is amazing for data-focused applications
  • 4.5 Catalyst is your app catalyzer
  • Summary
  • 5. Building a simple app for deployment
  • 5.1 An ingestionless example
  • 5.1.1 Calculating π
  • 5.1.2 The code to approximate π
  • 5.1.3 What are lambda functions in Java?
  • 5.1.4 Approximating π by using lambda functions
  • 5.2 Interacting with Spark
  • 5.2.1 Local mode
  • 5.2.2 Cluster mode
  • Submitting a job to Spark
  • Setting the cluster's master in your application
  • 5.2.3 Interactive mode in Scala and Python
  • Scala shell
  • Python shell
  • Summary
  • 6. Deploying your simple app
  • 6.1 Beyond the example: The role of the components
  • 6.1.1 Quick overview of the components and their interactions
  • 6.1.2 Troubleshooting tips for the Spark architecture
  • 6.1.3 Going further
  • 6.2 Building a cluster
  • 6.2.1 Building a cluster that works for you.
  • 6.2.2 Setting up the environment
  • 6.3 Building your application to run on the cluster
  • 6.3.1 Building your application's uber JAR
  • 6.3.2 Building your application by using Git and Maven
  • 6.4 Running your application on the cluster
  • 6.4.1 Submitting the uber JAR
  • 6.4.2 Running the application
  • 6.4.3 the Spark user interface
  • Summary
  • Part 2. Ingestion
  • 7. Ingestion from files
  • 7.1 Common behaviors of parsers
  • 7.2 Complex ingestion from CSV
  • 7.2.1 Desired output
  • 7.2.2 Code
  • 7.3 Ingesting a CSV with a known schema
  • 7.3.1 Desired output
  • 7.3.2 Code
  • 7.4 Ingesting a JSON file
  • 7.4.1 Desired output
  • 7.4.2 Code
  • 7.5 Ingesting a multiline JSON file
  • 7.5.1 Desired output
  • 7.5.2 Code
  • 7.6 Ingesting an XML file
  • 7.6.1 Desired output
  • 7.6.2 Code
  • 7.7 Ingesting a text file
  • 7.7.1 Desired output
  • 7.7.2 Code
  • 7.8 File formats for big data
  • 7.8.1 The problem with traditional file formats
  • 7.8.2 Avro is a schema-based serialization format
  • 7.8.3 ORC is a columnar storage format
  • 7.8.4 Parquet is also a columnar storage format
  • 7.8.5 Comparing Avro, ORC, and Parquet
  • 7.9 Ingesting Avro, ORC, and Parquet files
  • 7.9.1 Ingesting Avro
  • 7.9.2 Ingesting ORC
  • 7.9.3 Ingesting Parquet
  • 7.9.4 Reference table for ingesting Avro, ORC, or Parquet
  • Summary
  • 8. Ingestion from databases
  • 8.1 Ingestion from relational databases
  • 8.1.1 Database connection checklist
  • 8.1.2 Understanding the data used in the examples
  • 8.1.3 Desired output
  • 8.1.4 Code
  • 8.1.5 Alternative code
  • 8.2 The role of the dialect
  • 8.2.1 What is a dialect, anyway?
  • 8.2.2 JDBC dialects provided with Spark
  • 8.2.3 Building your own dialect
  • 8.3 Advanced queries and ingestion
  • 8.3.1 Filtering by using a WHERE clause
  • 8.3.2 Joining data in the database
  • 8.3.3 Performing Ingestion and partitioning.
  • 8.3.4 Summary of advanced features
  • 8.4 Ingestion from Elasticsearch
  • 8.4.1 Data flow
  • 8.4.2 The New York restaurants dataset digested by Spark
  • 8.4.3 Code to ingest the restaurant dataset from Elasticsearch
  • Summary
  • 9 Advanced ingestion: finding data sources and building your own
  • 9.1 What is a data source?
  • 9.2 Benefits of a direct connection to a data source
  • 9.2.1 Temporary files
  • 9.2.2 Data quality scripts
  • 9.2.3 Data on demand
  • 9.3 Finding data sources at Spark Packages
  • 9.4 Building your own data source
  • 9.4.1 Scope of the example project
  • 9.4.2 Your data source API and options
  • 9.5 Behind the scenes: Building the data source itself
  • 9.6 Using the register file and the advertiser class
  • 9.7 Understanding the relationship between the data and schema
  • 9.7.1 The data source builds the relation
  • 9.7.2 Inside the relation
  • 9.8 Building the schema from a JavaBean
  • 9.9 Building the dataframe is magic with the utilities
  • 9.10 The other classes
  • Summary
  • 10. Ingestion through structured streaming
  • 10.1 What's streaming?
  • 10.2 Creating your first stream
  • 10.2.1 Generating a file stream
  • 10.2.2 Consuming the records
  • 10.2.3 Getting records, not lines
  • 10.3 Ingesting data from network streams
  • 10.4 Dealing with multiple streams
  • 10.5 Differentiating discretized and structured streaming
  • Summary
  • Part 3. Transforming your data
  • 11. Working with SQL
  • 11.1 Working with Spark SQL
  • 11.2 The difference between local and global views
  • 11.3 Mixing the dataframe API and Spark SQL
  • 11.4 Don't DELETE it!
  • 11.5 Going further with SQL
  • Summary
  • 12 Transforming your data
  • 12.1 What is data transformation?
  • 12.2 Process and example of record-level transformation
  • 12.2.1 Data discovery to understand the complexity
  • 12.2.2 Data mapping to draw the process.
  • 12.2.3 Writing the transformation code
  • 12.2.4 Reviewing your data transformation to ensure a quality process
  • What about sorting?
  • Wrapping up your first Spark transformation
  • 12.3 Joining datasets
  • 12.3.1 A closer look at the datasets to join
  • 12.3.2 Building the list of higher education institutions per county
  • Initialization of Spark
  • Loading and preparing the data
  • 12.3.3 Performing the joins
  • Joining the FIPS county identifier with the higher ed dataset using a join
  • Joining the census data to get the county name
  • 12.4 Performing more transformations
  • Summary
  • 13. Transforming entire documents
  • 13.1 Transforming entire documents and their structure
  • 13.1.1 Flattening your JSON document
  • 13.1.2 Building nested documents for transfer and storage
  • 13.2 The magic behind static functions
  • 13.3 Performing more transformations
  • Summary
  • 14. Extending transformations with user-defined functions
  • 14.1 Extending Apache Spark
  • 14.2 Registering and calling a UDF
  • 14.2.1 Registering the UDF with Spark
  • 14.2.2 Using the UDF with the dataframe API
  • 14.2.3 Manipulating UDFs with SQL
  • 14.2.4 Implementing the UDF
  • 14.2.5 Writing the service itself
  • 14.3 Using UDFs to ensure a high level of data quality
  • 14.4 Considering UDFs' constraints
  • Summary
  • 15. Aggregating your data
  • 15.1 Aggregating data with Spark
  • 15.1.1 A quick reminder on aggregations
  • 15.1.2 Performing basic aggregations with Spark
  • Performing an aggregation using the dataframe API
  • Performing an aggregation using Spark SQL
  • 15.2 Performing aggregations with live data
  • 15.2.1 Preparing your dataset
  • 15.2.2 Aggregating data to better understand the schools
  • What is the average enrollment for each school?
  • What is the evolution of the number of students?
  • What is the higher enrollment per school and year?.
  • What is the minimal absenteeism per school?.