pyarrow.Table

class pyarrow.Table

Bases: pyarrow.lib._PandasConvertible

A collection of top-level named, equal length Arrow arrays.

Warning

Do not call this class’s constructor directly, use one of the from_* methods instead.

__init__()

Initialize self. See help(type(self)) for accurate signature.

Methods

add_column(self, int i, Column column) Add column to Table at position.
append_column(self, Column column) Append column at end of columns.
cast(self, Schema target_schema, bool safe=True) Cast table values to another schema
column(self, i) Select a column by its column name, or numeric index.
drop(self, columns) Drop one or more columns and return a new table.
equals(self, Table other) Check if contents of two tables are equal
flatten(self, MemoryPool memory_pool=None) Flatten this Table.
from_arrays(arrays[, names, schema, metadata]) Construct a Table from Arrow arrays or columns
from_batches(batches, Schema schema=None) Construct a Table from a sequence or iterator of Arrow RecordBatches
from_pandas(type cls, df, …[, nthreads, …]) Convert pandas.DataFrame to an Arrow Table.
itercolumns(self) Iterator over all columns in their numerical order
remove_column(self, int i) Create new Table with the indicated column removed
replace_schema_metadata(self[, metadata]) EXPERIMENTAL: Create shallow copy of table by replacing schema key-value metadata with the indicated new metadata (which may be None, which deletes any existing metadata
set_column(self, int i, Column column) Replace column in Table at position.
to_batches(self[, chunksize]) Convert Table to list of (contiguous) RecordBatch objects, with optimal maximum chunk size
to_pandas(self[, categories]) Convert to a pandas-compatible NumPy array or DataFrame, as appropriate
to_pydict(self) Converted the arrow::Table to an OrderedDict

Attributes

columns List of all columns in numerical order
num_columns Number of columns in this table
num_rows Number of rows in this table.
schema Schema of the table and its columns
shape (#rows, #columns)
add_column(self, int i, Column column)

Add column to Table at position. Returns new table

append_column(self, Column column)

Append column at end of columns. Returns new table

cast(self, Schema target_schema, bool safe=True)

Cast table values to another schema

Parameters:
  • target_schema (Schema) – Schema to cast to, the names and order of fields must match
  • safe (boolean, default True) – Check for overflows or other unsafe conversions
Returns:

casted (Table)

column(self, i)

Select a column by its column name, or numeric index.

Parameters:i (int or string) –
Returns:pyarrow.Column
columns

List of all columns in numerical order

Returns:list of pa.Column
drop(self, columns)

Drop one or more columns and return a new table.

columns: list of str

Returns pa.Table

equals(self, Table other)

Check if contents of two tables are equal

Parameters:other (pyarrow.Table) –
Returns:are_equal (boolean)
flatten(self, MemoryPool memory_pool=None)

Flatten this Table. Each column with a struct type is flattened into one column per struct field. Other columns are left unchanged.

Parameters:memory_pool (MemoryPool, default None) – For memory allocations, if required, otherwise use default pool
Returns:result (Table)
static from_arrays(arrays, names=None, schema=None, metadata=None)

Construct a Table from Arrow arrays or columns

Parameters:
  • arrays (list of pyarrow.Array or pyarrow.Column) – Equal-length arrays that should form the table.
  • names (list of str, optional) – Names for the table columns. If Columns passed, will be inferred. If Arrays passed, this argument is required
  • schema (Schema, default None) – If not passed, will be inferred from the arrays
Returns:

pyarrow.Table

static from_batches(batches, Schema schema=None)

Construct a Table from a sequence or iterator of Arrow RecordBatches

Parameters:
  • batches (sequence or iterator of RecordBatch) – Sequence of RecordBatch to be converted, all schemas must be equal
  • schema (Schema, default None) – If not passed, will be inferred from the first RecordBatch
Returns:

table (Table)

from_pandas(type cls, df, Schema schema=None, bool preserve_index=True, nthreads=None, columns=None, bool safe=True)

Convert pandas.DataFrame to an Arrow Table.

The column types in the resulting Arrow Table are inferred from the dtypes of the pandas.Series in the DataFrame. In the case of non-object Series, the NumPy dtype is translated to its Arrow equivalent. In the case of object, we need to guess the datatype by looking at the Python objects in this Series.

Be aware that Series of the object dtype don’t carry enough information to always lead to a meaningful Arrow type. In the case that we cannot infer a type, e.g. because the DataFrame is of length 0 or the Series only contains None/nan objects, the type is set to null. This behavior can be avoided by constructing an explicit schema and passing it to this function.

Parameters:
  • df (pandas.DataFrame) –
  • schema (pyarrow.Schema, optional) – The expected schema of the Arrow Table. This can be used to indicate the type of columns if we cannot infer it automatically.
  • preserve_index (bool, optional) – Whether to store the index as an additional column in the resulting Table.
  • nthreads (int, default None (may use up to system CPU count threads)) – If greater than 1, convert columns to Arrow in parallel using indicated number of threads
  • columns (list, optional) – List of column to be converted. If None, use all columns.
  • safe (boolean, default True) – Check for overflows or other unsafe conversions
Returns:

pyarrow.Table

Examples

>>> import pandas as pd
>>> import pyarrow as pa
>>> df = pd.DataFrame({
    ...     'int': [1, 2],
    ...     'str': ['a', 'b']
    ... })
>>> pa.Table.from_pandas(df)
<pyarrow.lib.Table object at 0x7f05d1fb1b40>
itercolumns(self)

Iterator over all columns in their numerical order

num_columns

Number of columns in this table

Returns:int
num_rows

Number of rows in this table.

Due to the definition of a table, all columns have the same number of rows.

Returns:int
remove_column(self, int i)

Create new Table with the indicated column removed

replace_schema_metadata(self, metadata=None)

EXPERIMENTAL: Create shallow copy of table by replacing schema key-value metadata with the indicated new metadata (which may be None, which deletes any existing metadata

Parameters:metadata (dict, default None) –
Returns:shallow_copy (Table)
schema

Schema of the table and its columns

Returns:pyarrow.Schema
set_column(self, int i, Column column)

Replace column in Table at position. Returns new table

shape

(#rows, #columns)

Returns:(int, int)
Type:Dimensions of the table
to_batches(self, chunksize=None)

Convert Table to list of (contiguous) RecordBatch objects, with optimal maximum chunk size

Parameters:chunksize (int, default None) – Maximum size for RecordBatch chunks. Individual chunks may be smaller depending on the chunk layout of individual columns
Returns:batches (list of RecordBatch)
to_pandas(self, categories=None, bool strings_to_categorical=False, bool zero_copy_only=False, bool integer_object_nulls=False, bool date_as_object=True, bool use_threads=True, bool deduplicate_objects=True, bool ignore_metadata=False)

Convert to a pandas-compatible NumPy array or DataFrame, as appropriate

Parameters:
  • strings_to_categorical (boolean, default False) – Encode string (UTF8) and binary types to pandas.Categorical
  • categories (list, default empty) – List of fields that should be returned as pandas.Categorical. Only applies to table-like data structures
  • zero_copy_only (boolean, default False) – Raise an ArrowException if this function call would require copying the underlying data
  • integer_object_nulls (boolean, default False) – Cast integers with nulls to objects
  • date_as_object (boolean, default False) – Cast dates to objects
  • use_threads (boolean, default True) – Whether to parallelize the conversion using multiple threads
  • deduplicate_objects (boolean, default False) – Do not create multiple copies Python objects when created, to save on memory use. Conversion will be slower
  • ignore_metadata (boolean, default False) – If True, do not use the ‘pandas’ metadata to reconstruct the DataFrame index, if present
Returns:

NumPy array or DataFrame depending on type of object

to_pydict(self)

Converted the arrow::Table to an OrderedDict

Returns:OrderedDict