In-Memory Analytics with Apache Arrow Accelerate Data Analytics for Efficient Processing of Flat and Hierarchical Data Structures

Apache Arrow is an open source, columnar in-memory data format designed for efficient data processing and analytics. This book harnesses the author’s 15 years of experience to show you a standardized way to work with tabular data across various programming languages and environments, enabling high-p...

Full description

Bibliographic Details
Other Authors: Topol, Matthew, author (author), McKinney, Wes, writer of foreword (writer of foreword)
Format: eBook
Language:Inglés
Published: Birmingham, England : Packt Publishing [2024]
Edition:Second edition
Subjects:
See on Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009853372806719
Table of Contents:
  • Cover
  • Title Page
  • Copyright and Credits
  • Foreword
  • Contributors
  • Table of Contents
  • Preface
  • Part 1: Overview of What Arrow is, Its Capabilities, Benefits, and Goals
  • Chapter 1: Getting Started with Apache Arrow
  • Technical requirements
  • Understanding the Arrow format and specifications
  • Why does Arrow use a columnar in-memory format?
  • Learning the terminology and physical memory layout
  • Quick summary of physical layouts, or TL
  • DR
  • How to speak Arrow
  • Arrow format versioning and stability
  • Would you download a library? Of course!
  • Setting up your shooting range
  • Using PyArrow for Python
  • C++ for the 1337 coders
  • Go, Arrow, go!
  • Summary
  • References
  • Chapter 2: Working with Key Arrow Specifications
  • Technical requirements
  • Playing with data, wherever it might be!
  • Working with Arrow tables
  • Accessing data files with PyArrow
  • Accessing data files with Arrow in C++
  • Bears firing arrows
  • Putting pandas in your quiver
  • Making pandas run fast
  • Keeping pandas from running wild
  • Polar bears use Rust-y arrows
  • Sharing is caring… especially when it's your memory
  • Diving into memory management
  • Managing buffers for performance
  • Crossing boundaries
  • Summary
  • Chapter 3: Format and Memory Handling
  • Technical requirements
  • Storage versus runtime in-memory versus message-passing formats
  • Long-term storage formats
  • In-memory runtime formats
  • Message-passing formats
  • Summing up
  • Passing your Arrows around
  • What is this sorcery?!
  • Producing and consuming Arrows
  • Learning about memory cartography
  • The base case
  • Parquet versus CSV
  • Mapping data into memory
  • Too long
  • didn't read (TL
  • DR) - computers are magic
  • Leaving the CPU - using device memory
  • Starting with a few pointers
  • Device-agnostic buffer handling
  • Summary.
  • Part 2: Interoperability with Arrow: The Power of Open Standards
  • Chapter 4: Crossing the Language Barrier with the Arrow C Data API
  • Technical requirements
  • Using the Arrow C data interface
  • The ArrowSchema structure
  • The ArrowArray structure
  • Example use cases
  • Using the C data API to export Arrow-formatted data
  • Importing Arrow data with Python
  • Exporting Arrow data with the C Data API from Python to Go
  • Streaming Arrow data between Python and Go
  • What about non-CPU device data?
  • The ArrowDeviceArray struct
  • Using ArrowDeviceArray
  • Other use cases
  • Some exercises
  • Summary
  • Chapter 5: Acero: A Streaming Arrow Execution Engine
  • Technical requirements
  • Letting Acero do the work for you
  • Input shaping
  • Value casting
  • Types of functions in Acero
  • Invoking functions
  • Using the C++ compute library
  • Using the compute library in Python
  • Picking the right tools
  • Adding a constant value to an array
  • Compute Add function
  • A simple for loop
  • Using std::for_each and reserve space
  • Divide and conquer
  • Always have a plan
  • Where does Acero fit?
  • Acero's core concepts
  • Let's get streaming!
  • Simplifying complexity
  • Summary
  • Chapter 6: Using the Arrow Datasets API
  • Technical requirements
  • Querying multifile datasets
  • Creating a sample dataset
  • Discovering dataset fragments
  • Filtering data programmatically
  • Expressing yourself - a quick detour
  • Using expressions for filtering data
  • Deriving and renaming columns (projecting)
  • Using the Datasets API in Python
  • Creating our sample dataset
  • Discovering the dataset
  • Using different file formats
  • Filtering and projecting columns with Python
  • Streaming results
  • Working with partitioned datasets
  • Writing partitioned data
  • Connecting everything together
  • Summary
  • Chapter 7: Exploring Apache Arrow Flight RPC.
  • Technical requirements
  • The basics and complications of gRPC
  • Building modern APIs for data
  • Efficiency and streaming are important
  • Arrow Flight's building blocks
  • Horizontal scalability with Arrow Flight
  • Adding your business logic to Flight
  • Other bells and whistles
  • Understanding the Flight Protobuf definitions
  • Using Flight, choose your language!
  • Building a Python Flight server
  • Building a Go Flight server
  • What is Flight SQL?
  • Setting up a performance test
  • Everyone gets a containerized development environment!
  • Running the performance test
  • Flight SQL, the new kid on the block
  • Summary
  • Chapter 8: Understanding Arrow Database Connectivity (ADBC)
  • Technical requirements
  • ODBC takes an Arrow to the knee
  • Lost in translation
  • Arrow adoption in ODBC drivers
  • The benefits of standards around connectivity
  • The ADBC specification
  • ADBC databases
  • ADBC connections
  • ADBC statements
  • ADBC error handling
  • Using ADBC for performance and adaptability
  • ADBC with C/C++
  • Using ADBC with Python
  • Using ADBC with Go
  • Summary
  • Chapter 9: Using Arrow with Machine Learning Workflows
  • Technical requirements
  • SPARKing new ideas on Jupyter
  • Understanding the integration of Arrow in Spark
  • Containerization makes life easier
  • SPARKing joy with Arrow and PySpark
  • Facehuggers implanting data
  • Setting up your environment
  • Proving the benefits by checking resource usage
  • Using Arrow with the standard tools for ML
  • More GPU, more speed!
  • Summary
  • Part 3: Real-World Examples, Use Cases, and Future Development
  • Chapter 10: Powered by Apache Arrow
  • Swimming in data with Dremio Sonar
  • Clarifying Dremio Sonar's architecture
  • The library of the gods…of data analysis
  • Spicing up your data workflows
  • Arrow in the browser using JavaScript
  • Gaining a little perspective.
  • Taking flight with Falcon
  • An Influx of connectivity
  • Summary
  • Chapter 11: How to Leave Your Mark on Arrow
  • Technical requirements
  • Contributing to open source projects
  • Communication is key
  • You don't necessarily have to contribute code
  • There are a lot of reasons why you should contribute!
  • Preparing your first pull request
  • Creating and navigating GitHub issues
  • Setting up Git
  • Orienting yourself in the code base
  • Building the Arrow libraries
  • Creating the pull request
  • Understanding Archery and the CI configuration
  • Find your interest and expand on it
  • Getting that sweet, sweet approval
  • Finishing up with style!
  • C++ code styling
  • Python code styling
  • Go code styling
  • Summary
  • Chapter 12: Future Development and Plans
  • Globetrotting with data - GeoArrow and GeoParquet
  • Collaboration breeds success
  • Expanding ADBC adoption
  • Final words
  • Index
  • Other Books You May Enjoy.