In-Memory Analytics with Apache Arrow Accelerate Data Analytics for Efficient Processing of Flat and Hierarchical Data Structures

Apache Arrow is an open source, columnar in-memory data format designed for efficient data processing and analytics. This book harnesses the author’s 15 years of experience to show you a standardized way to work with tabular data across various programming languages and environments, enabling high-p...

Full description

Bibliographic Details
Other Authors:	Topol, Matthew, author (author), McKinney, Wes, writer of foreword (writer of foreword)
Format:	eBook
Language:	Inglés
Published:	Birmingham, England : Packt Publishing [2024]
Edition:	Second edition
Subjects:	Data mining. Big data.
See on Biblioteca Universitat Ramon Llull:	https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009853372806719

Table of Contents:

Cover
Title Page
Copyright and Credits
Foreword
Contributors
Table of Contents
Preface
Part 1: Overview of What Arrow is, Its Capabilities, Benefits, and Goals
Chapter 1: Getting Started with Apache Arrow
Technical requirements
Understanding the Arrow format and specifications
Why does Arrow use a columnar in-memory format?
Learning the terminology and physical memory layout
Quick summary of physical layouts, or TL
DR
How to speak Arrow
Arrow format versioning and stability
Would you download a library? Of course!
Setting up your shooting range
Using PyArrow for Python
C++ for the 1337 coders
Go, Arrow, go!
Summary
References
Chapter 2: Working with Key Arrow Specifications
Technical requirements
Playing with data, wherever it might be!
Working with Arrow tables
Accessing data files with PyArrow
Accessing data files with Arrow in C++
Bears firing arrows
Putting pandas in your quiver
Making pandas run fast
Keeping pandas from running wild
Polar bears use Rust-y arrows
Sharing is caring… especially when it's your memory
Diving into memory management
Managing buffers for performance
Crossing boundaries
Summary
Chapter 3: Format and Memory Handling
Technical requirements
Storage versus runtime in-memory versus message-passing formats
Long-term storage formats
In-memory runtime formats
Message-passing formats
Summing up
Passing your Arrows around
What is this sorcery?!
Producing and consuming Arrows
Learning about memory cartography
The base case
Parquet versus CSV
Mapping data into memory
Too long
didn't read (TL
DR) - computers are magic
Leaving the CPU - using device memory
Starting with a few pointers
Device-agnostic buffer handling
Summary.
Part 2: Interoperability with Arrow: The Power of Open Standards
Chapter 4: Crossing the Language Barrier with the Arrow C Data API
Technical requirements
Using the Arrow C data interface
The ArrowSchema structure
The ArrowArray structure
Example use cases
Using the C data API to export Arrow-formatted data
Importing Arrow data with Python
Exporting Arrow data with the C Data API from Python to Go
Streaming Arrow data between Python and Go
What about non-CPU device data?
The ArrowDeviceArray struct
Using ArrowDeviceArray
Other use cases
Some exercises
Summary
Chapter 5: Acero: A Streaming Arrow Execution Engine
Technical requirements
Letting Acero do the work for you
Input shaping
Value casting
Types of functions in Acero
Invoking functions
Using the C++ compute library
Using the compute library in Python
Picking the right tools
Adding a constant value to an array
Compute Add function
A simple for loop
Using std::for_each and reserve space
Divide and conquer
Always have a plan
Where does Acero fit?
Acero's core concepts
Let's get streaming!
Simplifying complexity
Summary
Chapter 6: Using the Arrow Datasets API
Technical requirements
Querying multifile datasets
Creating a sample dataset
Discovering dataset fragments
Filtering data programmatically
Expressing yourself - a quick detour
Using expressions for filtering data
Deriving and renaming columns (projecting)
Using the Datasets API in Python
Creating our sample dataset
Discovering the dataset
Using different file formats
Filtering and projecting columns with Python
Streaming results
Working with partitioned datasets
Writing partitioned data
Connecting everything together
Summary
Chapter 7: Exploring Apache Arrow Flight RPC.
Technical requirements
The basics and complications of gRPC
Building modern APIs for data
Efficiency and streaming are important
Arrow Flight's building blocks
Horizontal scalability with Arrow Flight
Adding your business logic to Flight
Other bells and whistles
Understanding the Flight Protobuf definitions
Using Flight, choose your language!
Building a Python Flight server
Building a Go Flight server
What is Flight SQL?
Setting up a performance test
Everyone gets a containerized development environment!
Running the performance test
Flight SQL, the new kid on the block
Summary
Chapter 8: Understanding Arrow Database Connectivity (ADBC)
Technical requirements
ODBC takes an Arrow to the knee
Lost in translation
Arrow adoption in ODBC drivers
The benefits of standards around connectivity
The ADBC specification
ADBC databases
ADBC connections
ADBC statements
ADBC error handling
Using ADBC for performance and adaptability
ADBC with C/C++
Using ADBC with Python
Using ADBC with Go
Summary
Chapter 9: Using Arrow with Machine Learning Workflows
Technical requirements
SPARKing new ideas on Jupyter
Understanding the integration of Arrow in Spark
Containerization makes life easier
SPARKing joy with Arrow and PySpark
Facehuggers implanting data
Setting up your environment
Proving the benefits by checking resource usage
Using Arrow with the standard tools for ML
More GPU, more speed!
Summary
Part 3: Real-World Examples, Use Cases, and Future Development
Chapter 10: Powered by Apache Arrow
Swimming in data with Dremio Sonar
Clarifying Dremio Sonar's architecture
The library of the gods…of data analysis
Spicing up your data workflows
Arrow in the browser using JavaScript
Gaining a little perspective.
Taking flight with Falcon
An Influx of connectivity
Summary
Chapter 11: How to Leave Your Mark on Arrow
Technical requirements
Contributing to open source projects
Communication is key
You don't necessarily have to contribute code
There are a lot of reasons why you should contribute!
Preparing your first pull request
Creating and navigating GitHub issues
Setting up Git
Orienting yourself in the code base
Building the Arrow libraries
Creating the pull request
Understanding Archery and the CI configuration
Find your interest and expand on it
Getting that sweet, sweet approval
Finishing up with style!
C++ code styling
Python code styling
Go code styling
Summary
Chapter 12: Future Development and Plans
Globetrotting with data - GeoArrow and GeoParquet
Collaboration breeds success
Expanding ADBC adoption
Final words
Index
Other Books You May Enjoy.

In-Memory Analytics with Apache Arrow Accelerate Data Analytics for Efficient Processing of Flat and Hierarchical Data Structures

Similar Items