Data Types and In-Memory Data Model¶
Apache Arrow defines columnar array data structures by composing type metadata with memory buffers, like the ones explained in the documentation on Memory and IO. These data structures are exposed in Python through a series of interrelated classes:
- Type Metadata: Instances of
pyarrow.DataType
, which describe a logical array type - Schemas: Instances of
pyarrow.Schema
, which describe a named collection of types. These can be thought of as the column types in a table-like object. - Arrays: Instances of
pyarrow.Array
, which are atomic, contiguous columnar data structures composed from Arrow Buffer objects - Record Batches: Instances of
pyarrow.RecordBatch
, which are a collection of Array objects with a particular Schema - Tables: Instances of
pyarrow.Table
, a logical table data structure in which each column consists of one or morepyarrow.Array
objects of the same type.
We will examine these in the sections below in a series of examples.
Type Metadata¶
Apache Arrow defines language agnostic column-oriented data structures for array data. These include:
- Fixed-length primitive types: numbers, booleans, date and times, fixed size binary, decimals, and other values that fit into a given number
- Variable-length primitive types: binary, string
- Nested types: list, struct, and union
- Dictionary type: An encoded categorical type (more on this later)
Each logical data type in Arrow has a corresponding factory function for creating an instance of that type object in Python:
In [1]: import pyarrow as pa
In [2]: t1 = pa.int32()
In [3]: t2 = pa.string()
In [4]: t3 = pa.binary()
In [5]: t4 = pa.binary(10)
In [6]: t5 = pa.timestamp('ms')
In [7]: t1
Out[7]: DataType(int32)
In [8]: print(t1)