In-Memory Analytics with Apache Arrow Accelerate Data Analytics for Efficient Processing of Flat and Hierarchical Data Structures
Apache Arrow is an open source, columnar in-memory data format designed for efficient data processing and analytics. This book harnesses the author’s 15 years of experience to show you a standardized way to work with tabular data across various programming languages and environments, enabling high-p...
Other Authors: | , |
---|---|
Format: | eBook |
Language: | Inglés |
Published: |
Birmingham, England :
Packt Publishing
[2024]
|
Edition: | Second edition |
Subjects: | |
See on Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009853372806719 |
Table of Contents:
- Cover
- Title Page
- Copyright and Credits
- Foreword
- Contributors
- Table of Contents
- Preface
- Part 1: Overview of What Arrow is, Its Capabilities, Benefits, and Goals
- Chapter 1: Getting Started with Apache Arrow
- Technical requirements
- Understanding the Arrow format and specifications
- Why does Arrow use a columnar in-memory format?
- Learning the terminology and physical memory layout
- Quick summary of physical layouts, or TL
- DR
- How to speak Arrow
- Arrow format versioning and stability
- Would you download a library? Of course!
- Setting up your shooting range
- Using PyArrow for Python
- C++ for the 1337 coders
- Go, Arrow, go!
- Summary
- References
- Chapter 2: Working with Key Arrow Specifications
- Technical requirements
- Playing with data, wherever it might be!
- Working with Arrow tables
- Accessing data files with PyArrow
- Accessing data files with Arrow in C++
- Bears firing arrows
- Putting pandas in your quiver
- Making pandas run fast
- Keeping pandas from running wild
- Polar bears use Rust-y arrows
- Sharing is caring… especially when it's your memory
- Diving into memory management
- Managing buffers for performance
- Crossing boundaries
- Summary
- Chapter 3: Format and Memory Handling
- Technical requirements
- Storage versus runtime in-memory versus message-passing formats
- Long-term storage formats
- In-memory runtime formats
- Message-passing formats
- Summing up
- Passing your Arrows around
- What is this sorcery?!
- Producing and consuming Arrows
- Learning about memory cartography
- The base case
- Parquet versus CSV
- Mapping data into memory
- Too long
- didn't read (TL
- DR) - computers are magic
- Leaving the CPU - using device memory
- Starting with a few pointers
- Device-agnostic buffer handling
- Summary.
- Part 2: Interoperability with Arrow: The Power of Open Standards
- Chapter 4: Crossing the Language Barrier with the Arrow C Data API
- Technical requirements
- Using the Arrow C data interface
- The ArrowSchema structure
- The ArrowArray structure
- Example use cases
- Using the C data API to export Arrow-formatted data
- Importing Arrow data with Python
- Exporting Arrow data with the C Data API from Python to Go
- Streaming Arrow data between Python and Go
- What about non-CPU device data?
- The ArrowDeviceArray struct
- Using ArrowDeviceArray
- Other use cases
- Some exercises
- Summary
- Chapter 5: Acero: A Streaming Arrow Execution Engine
- Technical requirements
- Letting Acero do the work for you
- Input shaping
- Value casting
- Types of functions in Acero
- Invoking functions
- Using the C++ compute library
- Using the compute library in Python
- Picking the right tools
- Adding a constant value to an array
- Compute Add function
- A simple for loop
- Using std::for_each and reserve space
- Divide and conquer
- Always have a plan
- Where does Acero fit?
- Acero's core concepts
- Let's get streaming!
- Simplifying complexity
- Summary
- Chapter 6: Using the Arrow Datasets API
- Technical requirements
- Querying multifile datasets
- Creating a sample dataset
- Discovering dataset fragments
- Filtering data programmatically
- Expressing yourself - a quick detour
- Using expressions for filtering data
- Deriving and renaming columns (projecting)
- Using the Datasets API in Python
- Creating our sample dataset
- Discovering the dataset
- Using different file formats
- Filtering and projecting columns with Python
- Streaming results
- Working with partitioned datasets
- Writing partitioned data
- Connecting everything together
- Summary
- Chapter 7: Exploring Apache Arrow Flight RPC.
- Technical requirements
- The basics and complications of gRPC
- Building modern APIs for data
- Efficiency and streaming are important
- Arrow Flight's building blocks
- Horizontal scalability with Arrow Flight
- Adding your business logic to Flight
- Other bells and whistles
- Understanding the Flight Protobuf definitions
- Using Flight, choose your language!
- Building a Python Flight server
- Building a Go Flight server
- What is Flight SQL?
- Setting up a performance test
- Everyone gets a containerized development environment!
- Running the performance test
- Flight SQL, the new kid on the block
- Summary
- Chapter 8: Understanding Arrow Database Connectivity (ADBC)
- Technical requirements
- ODBC takes an Arrow to the knee
- Lost in translation
- Arrow adoption in ODBC drivers
- The benefits of standards around connectivity
- The ADBC specification
- ADBC databases
- ADBC connections
- ADBC statements
- ADBC error handling
- Using ADBC for performance and adaptability
- ADBC with C/C++
- Using ADBC with Python
- Using ADBC with Go
- Summary
- Chapter 9: Using Arrow with Machine Learning Workflows
- Technical requirements
- SPARKing new ideas on Jupyter
- Understanding the integration of Arrow in Spark
- Containerization makes life easier
- SPARKing joy with Arrow and PySpark
- Facehuggers implanting data
- Setting up your environment
- Proving the benefits by checking resource usage
- Using Arrow with the standard tools for ML
- More GPU, more speed!
- Summary
- Part 3: Real-World Examples, Use Cases, and Future Development
- Chapter 10: Powered by Apache Arrow
- Swimming in data with Dremio Sonar
- Clarifying Dremio Sonar's architecture
- The library of the gods…of data analysis
- Spicing up your data workflows
- Arrow in the browser using JavaScript
- Gaining a little perspective.
- Taking flight with Falcon
- An Influx of connectivity
- Summary
- Chapter 11: How to Leave Your Mark on Arrow
- Technical requirements
- Contributing to open source projects
- Communication is key
- You don't necessarily have to contribute code
- There are a lot of reasons why you should contribute!
- Preparing your first pull request
- Creating and navigating GitHub issues
- Setting up Git
- Orienting yourself in the code base
- Building the Arrow libraries
- Creating the pull request
- Understanding Archery and the CI configuration
- Find your interest and expand on it
- Getting that sweet, sweet approval
- Finishing up with style!
- C++ code styling
- Python code styling
- Go code styling
- Summary
- Chapter 12: Future Development and Plans
- Globetrotting with data - GeoArrow and GeoParquet
- Collaboration breeds success
- Expanding ADBC adoption
- Final words
- Index
- Other Books You May Enjoy.