pyarrow.parquet.ParquetDataset¶
-
class
pyarrow.parquet.
ParquetDataset
(path_or_paths, filesystem=None, schema=None, metadata=None, split_row_groups=False, validate_schema=True, filters=None, metadata_nthreads=1, memory_map=True)[source]¶ Bases:
object
Encapsulates details of reading a complete Parquet dataset possibly consisting of multiple files and partitions in subdirectories
Parameters: - path_or_paths (str or List[str]) – A directory name, single file name, or list of file names
- filesystem (FileSystem, default None) – If nothing passed, paths assumed to be found in the local on-disk filesystem
- metadata (pyarrow.parquet.FileMetaData) – Use metadata obtained elsewhere to validate file schemas
- schema (pyarrow.parquet.Schema) – Use schema obtained elsewhere to validate file schemas. Alternative to metadata parameter
- split_row_groups (boolean, default False) – Divide files into pieces for each row group in the file
- validate_schema (boolean, default True) – Check that individual file schemas are all the same / compatible
- filters (List[Tuple] or List[List[Tuple]] or None (default)) –
List of filters to apply, like
[[('x', '=', 0), ...], ...]
. This implements partition-level (hive) filtering only, i.e., to prevent the loading of some files of the dataset.Predicates are expressed in disjunctive normal form (DNF). This means that the innermost tuple describe a single column predicate. These inner predicate make are all combined with a conjunction (AND) into a larger predicate. The most outer list then combines all filters with a disjunction (OR). By this, we should be able to express all kinds of filters that are possible using boolean logic.
This function also supports passing in as List[Tuple]. These predicates are evaluated as a conjunction. To express OR in predictates, one must use the (preferred) List[List[Tuple]] notation.
- metadata_nthreads (int, default 1) – How many threads to allow the thread pool which is used to read the dataset metadata. Increasing this is helpful to read partitioned datasets.
- memory_map (boolean, default True) – If the source is a file path, use a memory map to read each file in the dataset if possible, which can improve performance in some environments
-
__init__
(path_or_paths, filesystem=None, schema=None, metadata=None, split_row_groups=False, validate_schema=True, filters=None, metadata_nthreads=1, memory_map=True)[source]¶ Initialize self. See help(type(self)) for accurate signature.
Methods
__init__
(path_or_paths[, filesystem, …])Initialize self. read
([columns, use_threads, use_pandas_metadata])Read multiple Parquet files as a single pyarrow.Table read_pandas
(**kwargs)Read dataset including pandas metadata, if any. validate_schemas
()-
read
(columns=None, use_threads=True, use_pandas_metadata=False)[source]¶ Read multiple Parquet files as a single pyarrow.Table
Parameters: - columns (List[str]) – Names of columns to read from the file
- use_threads (boolean, default True) – Perform multi-threaded column reads
- use_pandas_metadata (bool, default False) – Passed through to each dataset piece
Returns: pyarrow.Table – Content of the file as a table (of columns)