File System Interfaces

In this section, we discuss filesystem-like interfaces in PyArrow.

Hadoop File System (HDFS)

PyArrow comes with bindings to a C++-based interface to the Hadoop File System. You connect like so:

import pyarrow as pa
fs = pa.hdfs.connect(host, port, user=user, kerb_ticket=ticket_cache_path)
with fs.open(path, 'rb') as f:
    # Do something with f

By default, pyarrow.hdfs.HadoopFileSystem uses libhdfs, a JNI-based interface to the Java Hadoop client. This library is loaded at runtime (rather than at link / library load time, since the library may not be in your LD_LIBRARY_PATH), and relies on some environment variables.

  • HADOOP_HOME: the root of your installed Hadoop distribution. Often has lib/native/libhdfs.so.
  • JAVA_HOME: the location of your Java SDK installation.
  • ARROW_LIBHDFS_DIR (optional): explicit location of libhdfs.so if it is installed somewhere other than $HADOOP_HOME/lib/native.
  • CLASSPATH: must contain the Hadoop jars. You can set these using:
export CLASSPATH=`$HADOOP_HOME/bin/hdfs classpath --glob`

If CLASSPATH is not set, then it will be set automatically if the hadoop executable is in your system path, or if HADOOP_HOME is set.

You can also use libhdfs3, a thirdparty C++ library for HDFS from Pivotal Labs:

fs = pa.hdfs.connect(host, port, user=user, kerb_ticket=ticket_cache_path,
                    driver='libhdfs3')

HDFS API

hdfs.connect([host, port, user, …]) Connect to an HDFS cluster.
HadoopFileSystem.cat(path) Return contents of file as a bytes object
HadoopFileSystem.chmod(self, path, mode) Change file permissions
HadoopFileSystem.chown(self, path[, owner, …]) Change file permissions
HadoopFileSystem.delete(path[, recursive]) Delete the indicated file or directory
HadoopFileSystem.df(self) Return free space on disk, like the UNIX df command
HadoopFileSystem.disk_usage(path) Compute bytes used by all contents under indicated path in file tree
HadoopFileSystem.download(self, path, stream)
HadoopFileSystem.exists(self, path) Returns True if the path is known to the cluster, False if it does not (or there is an RPC error)
HadoopFileSystem.get_capacity(self) Get reported total capacity of file system
HadoopFileSystem.get_space_used(self) Get space used on file system
HadoopFileSystem.info(self, path) Return detailed HDFS information for path
HadoopFileSystem.ls(path[, detail]) Retrieve directory contents and metadata, if requested.
HadoopFileSystem.mkdir(path, **kwargs) Create directory in HDFS
HadoopFileSystem.open(self, path[, mode, …]) Open HDFS file for reading or writing
HadoopFileSystem.rename(path, new_path) Rename file, like UNIX mv command
HadoopFileSystem.rm(path[, recursive]) Alias for FileSystem.delete
HadoopFileSystem.upload(self, path, stream) Upload file-like object to HDFS path
HdfsFile