Reading and Writing the Apache Parquet Format¶
The Apache Parquet project provides a standardized open-source columnar storage format for use in data analysis systems. It was created originally for use in Apache Hadoop with systems like Apache Drill, Apache Hive, Apache Impala (incubating), and Apache Spark adopting it as a shared standard for high performance data IO.
Apache Arrow is an ideal in-memory transport layer for data that is being read or written with Parquet files. We have been concurrently developing the C++ implementation of Apache Parquet, which includes a native, multithreaded C++ adapter to and from in-memory Arrow data. PyArrow includes Python bindings to this code, which thus enables reading and writing Parquet files with pandas as well.
Obtaining pyarrow with Parquet Support¶
If you installed pyarrow
with pip or conda, it should be built with Parquet
support bundled:
In [1]: import pyarrow.parquet as pq
If you are building pyarrow
from source, you must use
-DARROW_PARQUET=ON
when compiling the C++ libraries and enable the Parquet
extensions when building pyarrow
. See the Python Development page for more details.
Reading and Writing Single Files¶
The functions read_table()
and write_table()
read and write the pyarrow.Table objects, respectively.
Let’s look at a simple table:
In [2]: import numpy as np
In [3]: import pandas as pd
In [4]: import pyarrow as pa
In [5]: df = pd.DataFrame({'one': [-1, np.nan, 2.5],
...: 'two': ['foo', 'bar', 'baz'],
...: 'three': [True, False, True]},
...: index=list('abc'))
...:
In [6]: table = pa.Table.from_pandas(df)
We write this to Parquet format with write_table
:
In [7]: import pyarrow.parquet as pq
In [8]: pq.write_table(table, 'example.parquet')
This creates a single Parquet file. In practice, a Parquet dataset may consist
of many files in many directories. We can read a single file back with
read_table
:
In [9]: table2 = pq.read_table('example.parquet')
In [10]: table2.to_pandas()
Out[10]:
one two three
a -1.0 foo True
b NaN bar False
c 2.5 baz True
You can pass a subset of columns to read, which can be much faster than reading the whole file (due to the columnar layout):
In [11]: pq.read_table('example.parquet', columns=['one', 'three'])
Out[11]:
pyarrow.Table
one: double
three: bool
metadata
--------
OrderedDict([(b'pandas',
b'{"index_columns": [{"kind": "serialized", "field_name": "__i'
b'ndex_level_0__"}], "column_indexes": [{"name": null, "field_'
b'name": null, "pandas_type": "unicode", "numpy_type": "object'
b'", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name":'
b' "one", "field_name": "one", "pandas_type": "float64", "nump'
b'y_type": "float64", "metadata": null}, {"name": "two", "fiel'
b'd_name": "two", "pandas_type": "unicode", "numpy_type": "obj'
b'ect", "metadata": null}, {"name": "three", "field_name": "th'
b'ree", "pandas_type": "bool", "numpy_type": "bool", "metadata'
b'": null}, {"name": null, "field_name": "__index_level_0__", '
b'"pandas_type": "unicode", "numpy_type": "object", "metadata"'
b': null}], "creator": {"library": "pyarrow", "version": "0.12'
b'.1.dev425+g828b4377f.d20190316"}, "pandas_version": "0.23.4"'
b'}')])
When reading a subset of columns from a file that used a Pandas dataframe as the
source, we use read_pandas
to maintain any additional index column data:
In [12]: pq.read_pandas('example.parquet', columns=['two']).to_pandas()
Out[12]:
two
a foo
b bar
c baz
We need not use a string to specify the origin of the file. It can be any of:
- A file path as a string
- A NativeFile from PyArrow
- A Python file object
In general, a Python file object will have the worst read performance, while a
string file path or an instance of NativeFile
(especially memory
maps) will perform the best.
Omitting the DataFrame index¶
When using pa.Table.from_pandas
to convert to an Arrow table, by default
one or more special columns are added to keep track of the index (row
labels). Storing the index takes extra space, so if your index is not valuable,
you may choose to omit it by passing preserve_index=False
In [13]: df = pd.DataFrame({'one': [-1, np.nan, 2.5],
....: 'two': ['foo', 'bar', 'baz'],
....: 'three': [True, False, True]},
....: index=list('abc'))
....:
In [14]: df
Out[14]:
one two three
a -1.0 foo True
b NaN bar False
c 2.5 baz True
In [15]: table = pa.Table.from_pandas(df, preserve_index=False)
Then we have:
In [16]: pq.write_table(table, 'example_noindex.parquet')
In [17]: t = pq.read_table('example_noindex.parquet')
In [18]: t.to_pandas()
Out[18]:
one two three
0 -1.0 foo True
1 NaN bar False
2 2.5 baz True
Here you see the index did not survive the round trip.
Finer-grained Reading and Writing¶
read_table
uses the ParquetFile
class, which has other features:
In [19]: parquet_file = pq.ParquetFile('example.parquet')
In [20]: parquet_file.metadata
Out[20]:
<pyarrow._parquet.FileMetaData object at 0x7fde3fc60908>
created_by: parquet-cpp version 1.5.1-SNAPSHOT
num_columns: 4
num_rows: 3
num_row_groups: 1
format_version: 1.0
serialized_size: 1167
In [21]: parquet_file.schema