Pandas Integration

To interface with pandas, PyArrow provides various conversion routines to consume pandas structures and convert back to them.

Note

While pandas uses NumPy as a backend, it has enough peculiarities (such as a different type system, and support for null values) that this is a separate topic from NumPy Integration.

To follow examples in this document, make sure to run:

In [1]: import pandas as pd

In [2]: import pyarrow as pa

DataFrames

The equivalent to a pandas DataFrame in Arrow is a Table. Both consist of a set of named columns of equal length. While pandas only supports flat columns, the Table also provides nested columns, thus it can represent more data than a DataFrame, so a full conversion is not always possible.

Conversion from a Table to a DataFrame is done by calling pyarrow.Table.to_pandas(). The inverse is then achieved by using pyarrow.Table.from_pandas().

import pyarrow as pa
import pandas as pd

df = pd.DataFrame({"a": [1, 2, 3]})
# Convert from pandas to Arrow
table = pa.Table.from_pandas(df)
# Convert back to pandas
df_new = table.to_pandas()

# Infer Arrow schema from pandas
schema = pa.Schema.from_pandas(df)

Series

In Arrow, the most similar structure to a pandas Series is an Array. It is a vector that contains data of the same type as linear memory. You can convert a pandas Series to an Arrow Array using pyarrow.Array.from_pandas(). As Arrow Arrays are always nullable, you can supply an optional mask using the mask parameter to mark all null-entries.

Type differences

With the current design of pandas and Arrow, it is not possible to convert all column types unmodified. One of the main issues here is that pandas has no support for nullable columns of arbitrary type. Also datetime64 is currently fixed to nanosecond resolution. On the other side, Arrow might be still missing support for some types.

pandas -> Arrow Conversion

Source Type (pandas) Destination Type (Arrow)
bool BOOL
(u)int{8,16,32,64} (U)INT{8,16,32,64}
float32 FLOAT
float64 DOUBLE
str / unicode STRING
pd.Categorical DICTIONARY
pd.Timestamp TIMESTAMP(unit=ns)
datetime.date DATE

Arrow -> pandas Conversion

Source Type (Arrow) Destination Type (pandas)
BOOL bool
BOOL with nulls object (with values True, False, None)
(U)INT{8,16,32,64} (u)int{8,16,32,64}
(U)INT{8,16,32,64} with nulls float64
FLOAT float32
DOUBLE float64
STRING str
DICTIONARY pd.Categorical
TIMESTAMP(unit=*) pd.Timestamp (np.datetime64[ns])
DATE object``(with ``datetime.date objects)

Categorical types

TODO

Datetime (Timestamp) types

TODO

Date types

While dates can be handled using the datetime64[ns] type in pandas, some systems work with object arrays of Python’s built-in datetime.date object:

In [3]: from datetime import date

In [4]: s = pd.Series([date(2018, 12, 31), None, date(2000, 1, 1)])

In [5]: s
Out[5]: 
0    2018-12-31
1          None
2    2000-01-01
dtype: object

When converting to an Arrow array, the date32 type will be used by default:

In [6]: arr = pa.array(s)

In [7]: arr.type
Out[7]: DataType(date32[day])

In [8]: arr[0]
Out[8]: datetime.date(2018, 12, 31)

To use the 64-bit date64, specify this explicitly:

In [9]: arr = pa.array(s, type='date64')

In [10]: arr.type
Out[10]: DataType(date64[ms])

When converting back with to_pandas, object arrays of datetime.date objects are returned:

In [11]: arr.to_pandas()
Out[11]: 
array([datetime.date(2018, 12, 31), None, datetime.date(2000, 1, 1)],
      dtype=object)

If you want to use NumPy’s datetime64 dtype instead, pass date_as_object=False:

In [12]: s2 = pd.Series(arr.to_pandas(date_as_object=False))

In [13]: s2.dtype
Out[13]: dtype('<M8[ns]')

Time types

TODO