Arrow specification documents¶
Currently, the Arrow specification consists of these pieces:
- Metadata specification (see Metadata: Logical types, schemas, data headers)
- Physical memory layout specification (see Physical memory layout)
- Logical Types, Schemas, and Record Batch Metadata (see Schema.fbs)
- Encapsulated Messages (see Message.fbs)
- Mechanics of messaging between Arrow systems (IPC, RPC, etc.) (see Interprocess messaging / communication (IPC))
- Tensor (Multi-dimensional array) Metadata (see Tensor.fbs and SparseTensor.fbs)
The metadata currently uses Google’s flatbuffers library for serializing a couple related pieces of information:
- Schemas for tables or record (row) batches. This contains the logical types, field names, and other metadata. Schemas do not contain any information about actual data.
- Data headers for record (row) batches. These must correspond to a known schema, and enable a system to send and receive Arrow row batches in a form that can be precisely disassembled or reconstructed.
Arrow Format Maturity and Stability¶
We have made significant progress hardening the Arrow in-memory format and Flatbuffer metadata since the project started in February 2016. We have integration tests which verify binary compatibility between the Java and C++ implementations, for example.
Major versions may still include breaking changes to the memory format or metadata, so it is recommended to use the same released version of all libraries in your applications for maximum compatibility. Data stored in the Arrow IPC formats should not be used for long term storage.