Apache Arrow¶
Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.
The project is developing a multi-language collection of libraries for solving systems problems related to in-memory analytical data processing. This includes such topics as:
- Zero-copy shared memory and RPC-based data movement
- Reading and writing file formats (like CSV, Apache ORC, and Apache Parquet)
- In-memory analytics and query processing
- C++ Implementation
- Python bindings
- Installing PyArrow
- Memory and IO Interfaces
- Data Types and In-Memory Data Model
- Streaming, Serialization, and IPC
- File System Interfaces
- The Plasma In-Memory Object Store
- NumPy Integration
- Pandas Integration
- Timestamps
- Reading CSV files
- Reading and Writing the Apache Parquet Format
- CUDA Integration
- Using pyarrow from C++ and Cython Code
- API Reference
- Getting Involved
- Benchmarks