pyarrow.parquet.ParquetDataset¶

class pyarrow.parquet.ParquetDataset(path_or_paths, filesystem=None, schema=None, metadata=None, split_row_groups=False, validate_schema=True, filters=None, metadata_nthreads=1, memory_map=True)[source]¶

Bases: object

Encapsulates details of reading a complete Parquet dataset possibly consisting of multiple files and partitions in subdirectories

Parameters:

path_or_paths (str or List[str]) – A directory name, single file name, or list of file names
filesystem (FileSystem, default None) – If nothing passed, paths assumed to be found in the local on-disk filesystem
metadata (pyarrow.parquet.FileMetaData) – Use metadata obtained elsewhere to validate file schemas
schema (pyarrow.parquet.Schema) – Use schema obtained elsewhere to validate file schemas. Alternative to metadata parameter
split_row_groups (boolean, default False) – Divide files into pieces for each row group in the file
validate_schema (boolean, default True) – Check that individual file schemas are all the same / compatible
filters (List[Tuple] or List[List[Tuple]] or None (default)) –
List of filters to apply, like [[('x', '=', 0), ...], ...]. This implements partition-level (hive) filtering only, i.e., to prevent the loading of some files of the dataset.

Predicates are expressed in disjunctive normal form (DNF). This means that the innermost tuple describe a single column predicate. These inner predicate make are all combined with a conjunction (AND) into a larger predicate. The most outer list then combines all filters with a disjunction (OR). By this, we should be able to express all kinds of filters that are possible using boolean logic.

This function also supports passing in as List[Tuple]. These predicates are evaluated as a conjunction. To express OR in predictates, one must use the (preferred) List[List[Tuple]] notation.
metadata_nthreads (int, default 1) – How many threads to allow the thread pool which is used to read the dataset metadata. Increasing this is helpful to read partitioned datasets.
memory_map (boolean, default True) – If the source is a file path, use a memory map to read each file in the dataset if possible, which can improve performance in some environments

__init__(path_or_paths, filesystem=None, schema=None, metadata=None, split_row_groups=False, validate_schema=True, filters=None, metadata_nthreads=1, memory_map=True)[source]¶: Initialize self. See help(type(self)) for accurate signature.

Methods

`__init__`(path_or_paths[, filesystem, …])	Initialize self.
`read`([columns, use_threads, use_pandas_metadata])	Read multiple Parquet files as a single pyarrow.Table
`read_pandas`(**kwargs)	Read dataset including pandas metadata, if any.
`validate_schemas`()

read(columns=None, use_threads=True, use_pandas_metadata=False)[source]¶

Read multiple Parquet files as a single pyarrow.Table

Parameters:	columns (List[str]) – Names of columns to read from the file use_threads (boolean, default True) – Perform multi-threaded column reads use_pandas_metadata (bool, default False) – Passed through to each dataset piece
Returns:	pyarrow.Table – Content of the file as a table (of columns)

read_pandas(**kwargs)[source]¶

Read dataset including pandas metadata, if any. Other arguments passed through to ParquetDataset.read, see docstring for further details

Returns:	pyarrow.Table – Content of the file as a table (of columns)

validate_schemas()[source]¶