Pandas Integration¶
To interface with pandas, PyArrow provides various conversion routines to consume pandas structures and convert back to them.
Note
While pandas uses NumPy as a backend, it has enough peculiarities (such as a different type system, and support for null values) that this is a separate topic from NumPy Integration.
To follow examples in this document, make sure to run:
In [1]: import pandas as pd
In [2]: import pyarrow as pa
DataFrames¶
The equivalent to a pandas DataFrame in Arrow is a Table. Both consist of a set of named columns of equal length. While pandas only supports flat columns, the Table also provides nested columns, thus it can represent more data than a DataFrame, so a full conversion is not always possible.
Conversion from a Table to a DataFrame is done by calling
pyarrow.Table.to_pandas()
. The inverse is then achieved by using
pyarrow.Table.from_pandas()
.
import pyarrow as pa
import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3]})
# Convert from pandas to Arrow
table = pa.Table.from_pandas(df)
# Convert back to pandas
df_new = table.to_pandas()
# Infer Arrow schema from pandas
schema = pa.Schema.from_pandas(df)
Series¶
In Arrow, the most similar structure to a pandas Series is an Array.
It is a vector that contains data of the same type as linear memory. You can
convert a pandas Series to an Arrow Array using pyarrow.Array.from_pandas()
.
As Arrow Arrays are always nullable, you can supply an optional mask using
the mask
parameter to mark all null-entries.
Type differences¶
With the current design of pandas and Arrow, it is not possible to convert all
column types unmodified. One of the main issues here is that pandas has no
support for nullable columns of arbitrary type. Also datetime64
is currently
fixed to nanosecond resolution. On the other side, Arrow might be still missing
support for some types.
pandas -> Arrow Conversion¶
Source Type (pandas) | Destination Type (Arrow) |
---|---|
bool |
BOOL |
(u)int{8,16,32,64} |
(U)INT{8,16,32,64} |
float32 |
FLOAT |
float64 |
DOUBLE |
str / unicode |
STRING |
pd.Categorical |
DICTIONARY |
pd.Timestamp |
TIMESTAMP(unit=ns) |
datetime.date |
DATE |
Arrow -> pandas Conversion¶
Source Type (Arrow) | Destination Type (pandas) |
---|---|
BOOL |
bool |
BOOL with nulls |
object (with values True , False , None ) |
(U)INT{8,16,32,64} |
(u)int{8,16,32,64} |
(U)INT{8,16,32,64} with nulls |
float64 |
FLOAT |
float32 |
DOUBLE |
float64 |
STRING |
str |
DICTIONARY |
pd.Categorical |
TIMESTAMP(unit=*) |
pd.Timestamp (np.datetime64[ns] ) |
DATE |
object``(with ``datetime.date objects) |
Categorical types¶
TODO
Datetime (Timestamp) types¶
TODO
Date types¶
While dates can be handled using the datetime64[ns]
type in
pandas, some systems work with object arrays of Python’s built-in
datetime.date
object:
In [3]: from datetime import date
In [4]: s = pd.Series([date(2018, 12, 31), None, date(2000, 1, 1)])
In [5]: s
Out[5]:
0 2018-12-31
1 None
2 2000-01-01
dtype: object
When converting to an Arrow array, the date32
type will be used by
default:
In [6]: arr = pa.array(s)
In [7]: arr.type
Out[7]: DataType(date32[day])
In [8]: arr[0]