Streaming, Serialization, and IPC¶
Writing and Reading Streams¶
Arrow defines two types of binary formats for serializing record batches:
- Streaming format: for sending an arbitrary length sequence of record batches. The format must be processed from start to end, and does not support random access
- File or Random Access format: for serializing a fixed number of record batches. Supports random access, and thus is very useful when used with memory maps
To follow this section, make sure to first read the section on Memory and IO.
Using streams¶
First, let’s create a small record batch:
In [1]: import pyarrow as pa
In [2]: data = [
...: pa.array([1, 2, 3, 4]),
...: pa.array(['foo', 'bar', 'baz', None]),
...: pa.array([True, None, False, True])
...: ]
...:
In [3]: batch = pa.RecordBatch.from_arrays(data, ['f0', 'f1', 'f2'])
In [4]: batch.num_rows
Out[4]: 4
In [5]: batch.num_columns