pyarrow.RecordBatch

class pyarrow.RecordBatch

Bases: pyarrow.lib._PandasConvertible

Batch of rows of columns of equal length

Warning

Do not call this class’s constructor directly, use one of the RecordBatch.from_* functions instead.

__init__()

Initialize self. See help(type(self)) for accurate signature.

Methods

column(self, i) Select single column from record batch
equals(self, RecordBatch other)
from_arrays(list arrays, names[, metadata]) Construct a RecordBatch from multiple pyarrow.Arrays
from_pandas(type cls, df, …[, nthreads, …]) Convert pandas.DataFrame to an Arrow RecordBatch
replace_schema_metadata(self[, metadata]) EXPERIMENTAL: Create shallow copy of record batch by replacing schema key-value metadata with the indicated new metadata (which may be None, which deletes any existing metadata
serialize(self[, memory_pool]) Write RecordBatch to Buffer as encapsulated IPC message
slice(self[, offset, length]) Compute zero-copy slice of this RecordBatch
to_pandas(self[, categories]) Convert to a pandas-compatible NumPy array or DataFrame, as appropriate
to_pydict(self) Converted the arrow::RecordBatch to an OrderedDict

Attributes

columns List of all columns in numerical order
num_columns Number of columns
num_rows Number of rows
schema Schema of the RecordBatch and its columns
column(self, i)

Select single column from record batch

Returns:column (pyarrow.Array)
columns

List of all columns in numerical order

Returns:list of pa.Column
equals(self, RecordBatch other)
static from_arrays(list arrays, names, metadata=None)

Construct a RecordBatch from multiple pyarrow.Arrays

Parameters:
  • arrays (list of pyarrow.Array) – column-wise data vectors
  • names (pyarrow.Schema or list of str) – schema or list of labels for the columns
Returns:

pyarrow.RecordBatch

from_pandas(type cls, df, Schema schema=None, bool preserve_index=True, nthreads=None, columns=None)

Convert pandas.DataFrame to an Arrow RecordBatch

Parameters:
  • df (pandas.DataFrame) –
  • schema (pyarrow.Schema, optional) – The expected schema of the RecordBatch. This can be used to indicate the type of columns if we cannot infer it automatically.
  • preserve_index (bool, optional) – Whether to store the index as an additional column in the resulting RecordBatch.
  • nthreads (int, default None (may use up to system CPU count threads)) – If greater than 1, convert columns to Arrow in parallel using indicated number of threads
  • columns (list, optional) – List of column to be converted. If None, use all columns.
Returns:

pyarrow.RecordBatch

num_columns

Number of columns

Returns:int
num_rows

Number of rows

Due to the definition of a RecordBatch, all columns have the same number of rows.

Returns:int
replace_schema_metadata(self, metadata=None)

EXPERIMENTAL: Create shallow copy of record batch by replacing schema key-value metadata with the indicated new metadata (which may be None, which deletes any existing metadata

Parameters:metadata (dict, default None) –
Returns:shallow_copy (RecordBatch)
schema

Schema of the RecordBatch and its columns

Returns:pyarrow.Schema
serialize(self, memory_pool=None)

Write RecordBatch to Buffer as encapsulated IPC message

Parameters:memory_pool (MemoryPool, default None) – Uses default memory pool if not specified
Returns:serialized (Buffer)
slice(self, offset=0, length=None)

Compute zero-copy slice of this RecordBatch

Parameters:
  • offset (int, default 0) – Offset from start of array to slice
  • length (int, default None) – Length of slice (default is until end of batch starting from offset)
Returns:

sliced (RecordBatch)

to_pandas(self, categories=None, bool strings_to_categorical=False, bool zero_copy_only=False, bool integer_object_nulls=False, bool date_as_object=True, bool use_threads=True, bool deduplicate_objects=True, bool ignore_metadata=False)

Convert to a pandas-compatible NumPy array or DataFrame, as appropriate

Parameters:
  • strings_to_categorical (boolean, default False) – Encode string (UTF8) and binary types to pandas.Categorical
  • categories (list, default empty) – List of fields that should be returned as pandas.Categorical. Only applies to table-like data structures
  • zero_copy_only (boolean, default False) – Raise an ArrowException if this function call would require copying the underlying data
  • integer_object_nulls (boolean, default False) – Cast integers with nulls to objects
  • date_as_object (boolean, default False) – Cast dates to objects
  • use_threads (boolean, default True) – Whether to parallelize the conversion using multiple threads
  • deduplicate_objects (boolean, default False) – Do not create multiple copies Python objects when created, to save on memory use. Conversion will be slower
  • ignore_metadata (boolean, default False) – If True, do not use the ‘pandas’ metadata to reconstruct the DataFrame index, if present
Returns:

NumPy array or DataFrame depending on type of object

to_pydict(self)

Converted the arrow::RecordBatch to an OrderedDict

Returns:OrderedDict