pyarrow.RecordBatch¶

class pyarrow.RecordBatch¶

Bases: pyarrow.lib._PandasConvertible

Batch of rows of columns of equal length

Warning

Do not call this class’s constructor directly, use one of the RecordBatch.from_* functions instead.

__init__()¶: Initialize self. See help(type(self)) for accurate signature.

Methods

`column`(self, i)	Select single column from record batch
`equals`(self, RecordBatch other)
`from_arrays`(list arrays, names[, metadata])	Construct a RecordBatch from multiple pyarrow.Arrays
`from_pandas`(type cls, df, …[, nthreads, …])	Convert pandas.DataFrame to an Arrow RecordBatch
`replace_schema_metadata`(self[, metadata])	EXPERIMENTAL: Create shallow copy of record batch by replacing schema key-value metadata with the indicated new metadata (which may be None, which deletes any existing metadata
`serialize`(self[, memory_pool])	Write RecordBatch to Buffer as encapsulated IPC message
`slice`(self[, offset, length])	Compute zero-copy slice of this RecordBatch
`to_pandas`(self[, categories])	Convert to a pandas-compatible NumPy array or DataFrame, as appropriate
`to_pydict`(self)	Converted the arrow::RecordBatch to an OrderedDict

Attributes

`columns`	List of all columns in numerical order
`num_columns`	Number of columns
`num_rows`	Number of rows
`schema`	Schema of the RecordBatch and its columns

column(self, i)¶

Select single column from record batch

Returns:	column (pyarrow.Array)

columns¶

List of all columns in numerical order

Returns:	list of pa.Column

equals(self, RecordBatch other)¶

static from_arrays(list arrays, names, metadata=None)¶

Construct a RecordBatch from multiple pyarrow.Arrays

Parameters:	arrays (list of pyarrow.Array) – column-wise data vectors names (pyarrow.Schema or list of str) – schema or list of labels for the columns
Returns:	pyarrow.RecordBatch

from_pandas(type cls, df, Schema schema=None, bool preserve_index=True, nthreads=None, columns=None)¶

Convert pandas.DataFrame to an Arrow RecordBatch

Parameters:

df (pandas.DataFrame) –
schema (pyarrow.Schema, optional) – The expected schema of the RecordBatch. This can be used to indicate the type of columns if we cannot infer it automatically.
preserve_index (bool, optional) – Whether to store the index as an additional column in the resulting RecordBatch.
nthreads (int, default None (may use up to system CPU count threads)) – If greater than 1, convert columns to Arrow in parallel using indicated number of threads
columns (list, optional) – List of column to be converted. If None, use all columns.

Returns:

pyarrow.RecordBatch

num_columns¶

Number of columns

Returns:	int

num_rows¶

Number of rows

Due to the definition of a RecordBatch, all columns have the same number of rows.

Returns:	int

replace_schema_metadata(self, metadata=None)¶

EXPERIMENTAL: Create shallow copy of record batch by replacing schema key-value metadata with the indicated new metadata (which may be None, which deletes any existing metadata

Parameters:	metadata (dict, default None) –
Returns:	shallow_copy (RecordBatch)

schema¶

Schema of the RecordBatch and its columns

Returns:	pyarrow.Schema

serialize(self, memory_pool=None)¶

Write RecordBatch to Buffer as encapsulated IPC message

Parameters:	memory_pool (MemoryPool, default None) – Uses default memory pool if not specified
Returns:	serialized (Buffer)

slice(self, offset=0, length=None)¶

Compute zero-copy slice of this RecordBatch

Parameters:	offset (int, default 0) – Offset from start of array to slice length (int, default None) – Length of slice (default is until end of batch starting from offset)
Returns:	sliced (RecordBatch)

to_pandas(self, categories=None, bool strings_to_categorical=False, bool zero_copy_only=False, bool integer_object_nulls=False, bool date_as_object=True, bool use_threads=True, bool deduplicate_objects=True, bool ignore_metadata=False)¶

Convert to a pandas-compatible NumPy array or DataFrame, as appropriate

Parameters:

strings_to_categorical (boolean, default False) – Encode string (UTF8) and binary types to pandas.Categorical
categories (list, default empty) – List of fields that should be returned as pandas.Categorical. Only applies to table-like data structures
zero_copy_only (boolean, default False) – Raise an ArrowException if this function call would require copying the underlying data
integer_object_nulls (boolean, default False) – Cast integers with nulls to objects
date_as_object (boolean, default False) – Cast dates to objects
use_threads (boolean, default True) – Whether to parallelize the conversion using multiple threads
deduplicate_objects (boolean, default False) – Do not create multiple copies Python objects when created, to save on memory use. Conversion will be slower
ignore_metadata (boolean, default False) – If True, do not use the ‘pandas’ metadata to reconstruct the DataFrame index, if present

Returns:

NumPy array or DataFrame depending on type of object

to_pydict(self)¶

Converted the arrow::RecordBatch to an OrderedDict

Returns:	OrderedDict