pyarrow.RecordBatch¶
-
class
pyarrow.
RecordBatch
¶ Bases:
pyarrow.lib._PandasConvertible
Batch of rows of columns of equal length
Warning
Do not call this class’s constructor directly, use one of the
RecordBatch.from_*
functions instead.-
__init__
()¶ Initialize self. See help(type(self)) for accurate signature.
Methods
column
(self, i)Select single column from record batch equals
(self, RecordBatch other)from_arrays
(list arrays, names[, metadata])Construct a RecordBatch from multiple pyarrow.Arrays from_pandas
(type cls, df, …[, nthreads, …])Convert pandas.DataFrame to an Arrow RecordBatch replace_schema_metadata
(self[, metadata])EXPERIMENTAL: Create shallow copy of record batch by replacing schema key-value metadata with the indicated new metadata (which may be None, which deletes any existing metadata serialize
(self[, memory_pool])Write RecordBatch to Buffer as encapsulated IPC message slice
(self[, offset, length])Compute zero-copy slice of this RecordBatch to_pandas
(self[, categories])Convert to a pandas-compatible NumPy array or DataFrame, as appropriate to_pydict
(self)Converted the arrow::RecordBatch to an OrderedDict Attributes
columns
List of all columns in numerical order num_columns
Number of columns num_rows
Number of rows schema
Schema of the RecordBatch and its columns -
column
(self, i)¶ Select single column from record batch
Returns: column (pyarrow.Array)
-
columns
¶ List of all columns in numerical order
Returns: list of pa.Column
-
equals
(self, RecordBatch other)¶
-
static
from_arrays
(list arrays, names, metadata=None)¶ Construct a RecordBatch from multiple pyarrow.Arrays
Parameters: - arrays (list of pyarrow.Array) – column-wise data vectors
- names (pyarrow.Schema or list of str) – schema or list of labels for the columns
Returns: pyarrow.RecordBatch
-
from_pandas
(type cls, df, Schema schema=None, bool preserve_index=True, nthreads=None, columns=None)¶ Convert pandas.DataFrame to an Arrow RecordBatch
Parameters: - df (pandas.DataFrame) –
- schema (pyarrow.Schema, optional) – The expected schema of the RecordBatch. This can be used to indicate the type of columns if we cannot infer it automatically.
- preserve_index (bool, optional) – Whether to store the index as an additional column in the resulting
RecordBatch
. - nthreads (int, default None (may use up to system CPU count threads)) – If greater than 1, convert columns to Arrow in parallel using indicated number of threads
- columns (list, optional) – List of column to be converted. If None, use all columns.
Returns: pyarrow.RecordBatch
-
num_columns
¶ Number of columns
Returns: int
-
num_rows
¶ Number of rows
Due to the definition of a RecordBatch, all columns have the same number of rows.
Returns: int
-
replace_schema_metadata
(self, metadata=None)¶ EXPERIMENTAL: Create shallow copy of record batch by replacing schema key-value metadata with the indicated new metadata (which may be None, which deletes any existing metadata
Parameters: metadata (dict, default None) – Returns: shallow_copy (RecordBatch)
-
schema
¶ Schema of the RecordBatch and its columns
Returns: pyarrow.Schema
-
serialize
(self, memory_pool=None)¶ Write RecordBatch to Buffer as encapsulated IPC message
Parameters: memory_pool (MemoryPool, default None) – Uses default memory pool if not specified Returns: serialized (Buffer)
-
slice
(self, offset=0, length=None)¶ Compute zero-copy slice of this RecordBatch
Parameters: - offset (int, default 0) – Offset from start of array to slice
- length (int, default None) – Length of slice (default is until end of batch starting from offset)
Returns: sliced (RecordBatch)
-
to_pandas
(self, categories=None, bool strings_to_categorical=False, bool zero_copy_only=False, bool integer_object_nulls=False, bool date_as_object=True, bool use_threads=True, bool deduplicate_objects=True, bool ignore_metadata=False)¶ Convert to a pandas-compatible NumPy array or DataFrame, as appropriate
Parameters: - strings_to_categorical (boolean, default False) – Encode string (UTF8) and binary types to pandas.Categorical
- categories (list, default empty) – List of fields that should be returned as pandas.Categorical. Only applies to table-like data structures
- zero_copy_only (boolean, default False) – Raise an ArrowException if this function call would require copying the underlying data
- integer_object_nulls (boolean, default False) – Cast integers with nulls to objects
- date_as_object (boolean, default False) – Cast dates to objects
- use_threads (boolean, default True) – Whether to parallelize the conversion using multiple threads
- deduplicate_objects (boolean, default False) – Do not create multiple copies Python objects when created, to save on memory use. Conversion will be slower
- ignore_metadata (boolean, default False) – If True, do not use the ‘pandas’ metadata to reconstruct the DataFrame index, if present
Returns: NumPy array or DataFrame depending on type of object
-
to_pydict
(self)¶ Converted the arrow::RecordBatch to an OrderedDict
Returns: OrderedDict
-