pyarrow.RecordBatch¶
-
class
pyarrow.RecordBatch¶ Bases:
pyarrow.lib._PandasConvertibleBatch of rows of columns of equal length
Warning
Do not call this class’s constructor directly, use one of the
RecordBatch.from_*functions instead.-
__init__()¶ Initialize self. See help(type(self)) for accurate signature.
Methods
column(self, i)Select single column from record batch equals(self, RecordBatch other)from_arrays(list arrays, names[, metadata])Construct a RecordBatch from multiple pyarrow.Arrays from_pandas(type cls, df, …[, nthreads, …])Convert pandas.DataFrame to an Arrow RecordBatch replace_schema_metadata(self[, metadata])EXPERIMENTAL: Create shallow copy of record batch by replacing schema key-value metadata with the indicated new metadata (which may be None, which deletes any existing metadata serialize(self[, memory_pool])Write RecordBatch to Buffer as encapsulated IPC message slice(self[, offset, length])Compute zero-copy slice of this RecordBatch to_pandas(self[, categories])Convert to a pandas-compatible NumPy array or DataFrame, as appropriate to_pydict(self)Converted the arrow::RecordBatch to an OrderedDict Attributes
columnsList of all columns in numerical order num_columnsNumber of columns num_rowsNumber of rows schemaSchema of the RecordBatch and its columns -
column(self, i)¶ Select single column from record batch
Returns: column (pyarrow.Array)
-
columns¶ List of all columns in numerical order
Returns: list of pa.Column
-
equals(self, RecordBatch other)¶
-
static
from_arrays(list arrays, names, metadata=None)¶ Construct a RecordBatch from multiple pyarrow.Arrays
Parameters: - arrays (list of pyarrow.Array) – column-wise data vectors
- names (pyarrow.Schema or list of str) – schema or list of labels for the columns
Returns: pyarrow.RecordBatch
-
from_pandas(type cls, df, Schema schema=None, bool preserve_index=True, nthreads=None, columns=None)¶ Convert pandas.DataFrame to an Arrow RecordBatch
Parameters: - df (pandas.DataFrame) –
- schema (pyarrow.Schema, optional) – The expected schema of the RecordBatch. This can be used to indicate the type of columns if we cannot infer it automatically.
- preserve_index (bool, optional) – Whether to store the index as an additional column in the resulting
RecordBatch. - nthreads (int, default None (may use up to system CPU count threads)) – If greater than 1, convert columns to Arrow in parallel using indicated number of threads
- columns (list, optional) – List of column to be converted. If None, use all columns.
Returns: pyarrow.RecordBatch
-
num_columns¶ Number of columns
Returns: int
-
num_rows¶ Number of rows
Due to the definition of a RecordBatch, all columns have the same number of rows.
Returns: int
-
replace_schema_metadata(self, metadata=None)¶ EXPERIMENTAL: Create shallow copy of record batch by replacing schema key-value metadata with the indicated new metadata (which may be None, which deletes any existing metadata
Parameters: metadata (dict, default None) – Returns: shallow_copy (RecordBatch)
-
schema¶ Schema of the RecordBatch and its columns
Returns: pyarrow.Schema
-
serialize(self, memory_pool=None)¶ Write RecordBatch to Buffer as encapsulated IPC message
Parameters: memory_pool (MemoryPool, default None) – Uses default memory pool if not specified Returns: serialized (Buffer)
-
slice(self, offset=0, length=None)¶ Compute zero-copy slice of this RecordBatch
Parameters: - offset (int, default 0) – Offset from start of array to slice
- length (int, default None) – Length of slice (default is until end of batch starting from offset)
Returns: sliced (RecordBatch)
-
to_pandas(self, categories=None, bool strings_to_categorical=False, bool zero_copy_only=False, bool integer_object_nulls=False, bool date_as_object=True, bool use_threads=True, bool deduplicate_objects=True, bool ignore_metadata=False)¶ Convert to a pandas-compatible NumPy array or DataFrame, as appropriate
Parameters: - strings_to_categorical (boolean, default False) – Encode string (UTF8) and binary types to pandas.Categorical
- categories (list, default empty) – List of fields that should be returned as pandas.Categorical. Only applies to table-like data structures
- zero_copy_only (boolean, default False) – Raise an ArrowException if this function call would require copying the underlying data
- integer_object_nulls (boolean, default False) – Cast integers with nulls to objects
- date_as_object (boolean, default False) – Cast dates to objects
- use_threads (boolean, default True) – Whether to parallelize the conversion using multiple threads
- deduplicate_objects (boolean, default False) – Do not create multiple copies Python objects when created, to save on memory use. Conversion will be slower
- ignore_metadata (boolean, default False) – If True, do not use the ‘pandas’ metadata to reconstruct the DataFrame index, if present
Returns: NumPy array or DataFrame depending on type of object
-
to_pydict(self)¶ Converted the arrow::RecordBatch to an OrderedDict
Returns: OrderedDict
-