pyarrow.parquet.ParquetFile¶
-
class
pyarrow.parquet.
ParquetFile
(source, metadata=None, common_metadata=None, memory_map=True)[source]¶ Bases:
object
Reader interface for a single Parquet file
Parameters: - source (str, pathlib.Path, pyarrow.NativeFile, or file-like object) – Readable source. For passing bytes or buffer-like file containing a Parquet file, use pyarorw.BufferReader
- metadata (ParquetFileMetadata, default None) – Use existing metadata object, rather than reading from file.
- common_metadata (ParquetFileMetadata, default None) – Will be used in reads for pandas schema metadata if not found in the main file’s metadata, no other uses at the moment
- memory_map (boolean, default True) – If the source is a file path, use a memory map to read file, which can improve performance in some environments
-
__init__
(source, metadata=None, common_metadata=None, memory_map=True)[source]¶ Initialize self. See help(type(self)) for accurate signature.
Methods
__init__
(source[, metadata, …])Initialize self. read
([columns, use_threads, use_pandas_metadata])Read a Table from Parquet format read_row_group
(i[, columns, use_threads, …])Read a single row group from a Parquet file scan_contents
([columns, batch_size])Read contents of file with a single thread for indicated columns and batch size. Attributes
metadata
num_row_groups
schema
-
metadata
¶
-
num_row_groups
¶
-
read
(columns=None, use_threads=True, use_pandas_metadata=False)[source]¶ Read a Table from Parquet format
Parameters: - columns (list) – If not None, only these columns will be read from the file. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’
- use_threads (boolean, default True) – Perform multi-threaded column reads
- use_pandas_metadata (boolean, default False) – If True and file has custom pandas schema metadata, ensure that index columns are also loaded
Returns: pyarrow.table.Table – Content of the file as a table (of columns)
-
read_row_group
(i, columns=None, use_threads=True, use_pandas_metadata=False)[source]¶ Read a single row group from a Parquet file
Parameters: - columns (list) – If not None, only these columns will be read from the row group. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’
- use_threads (boolean, default True) – Perform multi-threaded column reads
- use_pandas_metadata (boolean, default False) – If True and file has custom pandas schema metadata, ensure that index columns are also loaded
Returns: pyarrow.table.Table – Content of the row group as a table (of columns)
-
scan_contents
(columns=None, batch_size=65536)[source]¶ Read contents of file with a single thread for indicated columns and batch size. Number of rows in file is returned. This function is used for benchmarking
Parameters: - columns (list of integers, default None) – If None, scan all columns
- batch_size (int, default 64K) – Number of rows to read at a time internally
Returns: num_rows (number of rows in file)
-
schema
¶