pyarrow.Table¶
-
class
pyarrow.
Table
¶ Bases:
pyarrow.lib._PandasConvertible
A collection of top-level named, equal length Arrow arrays.
Warning
Do not call this class’s constructor directly, use one of the
from_*
methods instead.-
__init__
()¶ Initialize self. See help(type(self)) for accurate signature.
Methods
add_column
(self, int i, Column column)Add column to Table at position. append_column
(self, Column column)Append column at end of columns. cast
(self, Schema target_schema, bool safe=True)Cast table values to another schema column
(self, i)Select a column by its column name, or numeric index. drop
(self, columns)Drop one or more columns and return a new table. equals
(self, Table other)Check if contents of two tables are equal flatten
(self, MemoryPool memory_pool=None)Flatten this Table. from_arrays
(arrays[, names, schema, metadata])Construct a Table from Arrow arrays or columns from_batches
(batches, Schema schema=None)Construct a Table from a sequence or iterator of Arrow RecordBatches from_pandas
(type cls, df, …[, nthreads, …])Convert pandas.DataFrame to an Arrow Table. itercolumns
(self)Iterator over all columns in their numerical order remove_column
(self, int i)Create new Table with the indicated column removed replace_schema_metadata
(self[, metadata])EXPERIMENTAL: Create shallow copy of table by replacing schema key-value metadata with the indicated new metadata (which may be None, which deletes any existing metadata set_column
(self, int i, Column column)Replace column in Table at position. to_batches
(self[, chunksize])Convert Table to list of (contiguous) RecordBatch objects, with optimal maximum chunk size to_pandas
(self[, categories])Convert to a pandas-compatible NumPy array or DataFrame, as appropriate to_pydict
(self)Converted the arrow::Table to an OrderedDict Attributes
columns
List of all columns in numerical order num_columns
Number of columns in this table num_rows
Number of rows in this table. schema
Schema of the table and its columns shape
(#rows, #columns) -
add_column
(self, int i, Column column)¶ Add column to Table at position. Returns new table
-
append_column
(self, Column column)¶ Append column at end of columns. Returns new table
-
cast
(self, Schema target_schema, bool safe=True)¶ Cast table values to another schema
Parameters: - target_schema (Schema) – Schema to cast to, the names and order of fields must match
- safe (boolean, default True) – Check for overflows or other unsafe conversions
Returns: casted (Table)
-
column
(self, i)¶ Select a column by its column name, or numeric index.
Parameters: i (int or string) – Returns: pyarrow.Column
-
columns
¶ List of all columns in numerical order
Returns: list of pa.Column
-
drop
(self, columns)¶ Drop one or more columns and return a new table.
columns: list of str
Returns pa.Table
-
equals
(self, Table other)¶ Check if contents of two tables are equal
Parameters: other (pyarrow.Table) – Returns: are_equal (boolean)
-
flatten
(self, MemoryPool memory_pool=None)¶ Flatten this Table. Each column with a struct type is flattened into one column per struct field. Other columns are left unchanged.
Parameters: memory_pool (MemoryPool, default None) – For memory allocations, if required, otherwise use default pool Returns: result (Table)
-
static
from_arrays
(arrays, names=None, schema=None, metadata=None)¶ Construct a Table from Arrow arrays or columns
Parameters: - arrays (list of pyarrow.Array or pyarrow.Column) – Equal-length arrays that should form the table.
- names (list of str, optional) – Names for the table columns. If Columns passed, will be inferred. If Arrays passed, this argument is required
- schema (Schema, default None) – If not passed, will be inferred from the arrays
Returns: pyarrow.Table
-
static
from_batches
(batches, Schema schema=None)¶ Construct a Table from a sequence or iterator of Arrow RecordBatches
Parameters: - batches (sequence or iterator of RecordBatch) – Sequence of RecordBatch to be converted, all schemas must be equal
- schema (Schema, default None) – If not passed, will be inferred from the first RecordBatch
Returns: table (Table)
-
from_pandas
(type cls, df, Schema schema=None, bool preserve_index=True, nthreads=None, columns=None, bool safe=True)¶ Convert pandas.DataFrame to an Arrow Table.
The column types in the resulting Arrow Table are inferred from the dtypes of the pandas.Series in the DataFrame. In the case of non-object Series, the NumPy dtype is translated to its Arrow equivalent. In the case of object, we need to guess the datatype by looking at the Python objects in this Series.
Be aware that Series of the object dtype don’t carry enough information to always lead to a meaningful Arrow type. In the case that we cannot infer a type, e.g. because the DataFrame is of length 0 or the Series only contains None/nan objects, the type is set to null. This behavior can be avoided by constructing an explicit schema and passing it to this function.
Parameters: - df (pandas.DataFrame) –
- schema (pyarrow.Schema, optional) – The expected schema of the Arrow Table. This can be used to indicate the type of columns if we cannot infer it automatically.
- preserve_index (bool, optional) – Whether to store the index as an additional column in the resulting
Table
. - nthreads (int, default None (may use up to system CPU count threads)) – If greater than 1, convert columns to Arrow in parallel using indicated number of threads
- columns (list, optional) – List of column to be converted. If None, use all columns.
- safe (boolean, default True) – Check for overflows or other unsafe conversions
Returns: pyarrow.Table
Examples
>>> import pandas as pd >>> import pyarrow as pa >>> df = pd.DataFrame({ ... 'int': [1, 2], ... 'str': ['a', 'b'] ... }) >>> pa.Table.from_pandas(df) <pyarrow.lib.Table object at 0x7f05d1fb1b40>
-
itercolumns
(self)¶ Iterator over all columns in their numerical order
-
num_columns
¶ Number of columns in this table
Returns: int
-
num_rows
¶ Number of rows in this table.
Due to the definition of a table, all columns have the same number of rows.
Returns: int
-
remove_column
(self, int i)¶ Create new Table with the indicated column removed
-
replace_schema_metadata
(self, metadata=None)¶ EXPERIMENTAL: Create shallow copy of table by replacing schema key-value metadata with the indicated new metadata (which may be None, which deletes any existing metadata
Parameters: metadata (dict, default None) – Returns: shallow_copy (Table)
-
schema
¶ Schema of the table and its columns
Returns: pyarrow.Schema
-
set_column
(self, int i, Column column)¶ Replace column in Table at position. Returns new table
-
shape
¶ (#rows, #columns)
Returns: (int, int) Type: Dimensions of the table
-
to_batches
(self, chunksize=None)¶ Convert Table to list of (contiguous) RecordBatch objects, with optimal maximum chunk size
Parameters: chunksize (int, default None) – Maximum size for RecordBatch chunks. Individual chunks may be smaller depending on the chunk layout of individual columns Returns: batches (list of RecordBatch)
-
to_pandas
(self, categories=None, bool strings_to_categorical=False, bool zero_copy_only=False, bool integer_object_nulls=False, bool date_as_object=True, bool use_threads=True, bool deduplicate_objects=True, bool ignore_metadata=False)¶ Convert to a pandas-compatible NumPy array or DataFrame, as appropriate
Parameters: - strings_to_categorical (boolean, default False) – Encode string (UTF8) and binary types to pandas.Categorical
- categories (list, default empty) – List of fields that should be returned as pandas.Categorical. Only applies to table-like data structures
- zero_copy_only (boolean, default False) – Raise an ArrowException if this function call would require copying the underlying data
- integer_object_nulls (boolean, default False) – Cast integers with nulls to objects
- date_as_object (boolean, default False) – Cast dates to objects
- use_threads (boolean, default True) – Whether to parallelize the conversion using multiple threads
- deduplicate_objects (boolean, default False) – Do not create multiple copies Python objects when created, to save on memory use. Conversion will be slower
- ignore_metadata (boolean, default False) – If True, do not use the ‘pandas’ metadata to reconstruct the DataFrame index, if present
Returns: NumPy array or DataFrame depending on type of object
-
to_pydict
(self)¶ Converted the arrow::Table to an OrderedDict
Returns: OrderedDict
-