Python Development¶
This page provides general Python development guidelines and source build instructions for all platforms.
Coding Style¶
We follow a similar PEP8-like coding style to the pandas project.
The code must pass flake8
(available from pip or conda) or it will fail the
build. Check for style errors before submitting your pull request with:
flake8 .
flake8 --config=.flake8.cython .
Unit Testing¶
We are using pytest to develop our unit test suite. After building the project (see below) you can run its unit tests like so:
pytest pyarrow
Package requirements to run the unit tests are found in
requirements-test.txt
and can be installed if needed with pip -r
requirements-test.txt
.
The project has a number of custom command line options for its test suite. Some tests are disabled by default, for example. To see all the options, run
pytest pyarrow --help
and look for the “custom options” section.
Test Groups¶
We have many tests that are grouped together using pytest marks. Some of these
are disabled by default. To enable a test group, pass --$GROUP_NAME
,
e.g. --parquet
. To disable a test group, prepend disable
, so
--disable-parquet
for example. To run only the unit tests for a
particular group, prepend only-
instead, for example --only-parquet
.
The test groups currently include:
gandiva
: tests for Gandiva expression compiler (uses LLVM)hdfs
: tests that use libhdfs or libhdfs3 to access the Hadoop filesystemhypothesis
: tests that use thehypothesis
module for generating random test cases. Note that--hypothesis
doesn’t work due to a quirk with pytest, so you have to pass--enable-hypothesis
large_memory
: Test requiring a large amount of system RAMorc
: Apache ORC testsparquet
: Apache Parquet testsplasma
: Plasma Object Store testss3
: Tests for Amazon S3tensorflow
: Tests that involve TensorFlow
Benchmarking¶
For running the benchmarks, see Benchmarks.
Building on Linux and MacOS¶
System Requirements¶
On macOS, any modern XCode (6.4 or higher; the current version is 8.3.1) is sufficient.
On Linux, for this guide, we require a minimum of gcc 4.8, or clang 3.7 or higher. You can check your version by running
$ gcc --version
If the system compiler is older than gcc 4.8, it can be set to a newer version
using the $CC
and $CXX
environment variables:
export CC=gcc-4.8
export CXX=g++-4.8
Environment Setup and Build¶
First, let’s clone the Arrow git repository:
mkdir repos
cd repos
git clone https://github.com/apache/arrow.git
You should now see
$ ls -l
total 8
drwxrwxr-x 12 wesm wesm 4096 Apr 15 19:19 arrow/
Using Conda¶
Let’s create a conda environment with all the C++ build and Python dependencies from conda-forge, targeting development for Python 3.7:
On Linux and OSX:
conda create -y -n pyarrow-dev -c conda-forge \
--file arrow/ci/conda_env_unix.yml \
--file arrow/ci/conda_env_cpp.yml \
--file arrow/ci/conda_env_python.yml \
compilers \
python=3.7
As of January 2019, the compilers package is needed on many Linux distributions to use packages from conda-forge.
With this out of the way, you can now activate the conda environment
conda activate pyarrow-dev
For Windows, see the Building on Windows section below.
We need to set some environment variables to let Arrow’s build system know about our build toolchain:
export ARROW_HOME=$CONDA_PREFIX
Using pip¶
Warning
If you installed Python using the Anaconda distribution or Miniconda, you cannot currently use virtualenv
to manage your development. Please follow the conda-based development
instructions instead.
On macOS, install all dependencies through Homebrew that are required for building Arrow C++:
brew update && brew bundle --file=arrow/python/Brewfile
On Debian/Ubuntu, you need the following minimal set of dependencies. All other dependencies will be automatically built by Arrow’s third-party toolchain.
$ sudo apt-get install libjemalloc-dev libboost-dev \
libboost-filesystem-dev \
libboost-system-dev \
libboost-regex-dev \
python-dev \
autoconf \
flex \
bison
If you are building Arrow for Python 3, install python3-dev
instead of python-dev
.
On Arch Linux, you can get these dependencies via pacman.
$ sudo pacman -S jemalloc boost
Now, let’s create a Python virtualenv with all Python dependencies in the same folder as the repositories and a target installation folder:
virtualenv pyarrow
source ./pyarrow/bin/activate
pip install six numpy pandas cython pytest
# This is the folder where we will install the Arrow libraries during
# development
mkdir dist
If your cmake version is too old on Linux, you could get a newer one via
pip install cmake
.
We need to set some environment variables to let Arrow’s build system know about our build toolchain:
export ARROW_HOME=$(pwd)/dist
export LD_LIBRARY_PATH=$(pwd)/dist/lib:$LD_LIBRARY_PATH
Build and test¶
Now build and install the Arrow C++ libraries:
mkdir arrow/cpp/build
pushd arrow/cpp/build
cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
-DCMAKE_INSTALL_LIBDIR=lib \
-DARROW_FLIGHT=ON \
-DARROW_GANDIVA=ON \
-DARROW_ORC=ON \
-DARROW_PARQUET=ON \
-DARROW_PYTHON=ON \
-DARROW_PLASMA=ON \
-DARROW_BUILD_TESTS=ON \
..
make -j4
make install
popd
Many of these components are optional, and can be switched off by setting them
to OFF
:
ARROW_FLIGHT
: RPC frameworkARROW_GANDIVA
: LLVM-based expression compilerARROW_ORC
: Support for Apache ORC file formatARROW_PARQUET
: Support for Apache Parquet file formatARROW_PLASMA
: Shared memory object store
If multiple versions of Python are installed in your environment, you may have to pass additional parameters to cmake so that it can find the right executable, headers and libraries. For example, specifying -DPYTHON_EXECUTABLE=$VIRTUAL_ENV/bin/python (assuming that you’re in virtualenv) enables cmake to choose the python executable which you are using.
Note
On Linux systems with support for building on multiple architectures,
make
may install libraries in the lib64
directory by default. For
this reason we recommend passing -DCMAKE_INSTALL_LIBDIR=lib
because the
Python build scripts assume the library directory is lib
Now, build pyarrow:
pushd arrow/python
export PYARROW_WITH_FLIGHT=1
export PYARROW_WITH_GANDIVA=1
export PYARROW_WITH_ORC=1
export PYARROW_WITH_PARQUET=1
python setup.py build_ext --build-type=$ARROW_BUILD_TYPE --inplace
popd
If you did not build one of the optional components, set the corresponding
PYARROW_WITH_$COMPONENT
environment variable to 0.
You should be able to run the unit tests with:
$ py.test pyarrow
================================ test session starts ====================
platform linux -- Python 3.6.1, pytest-3.0.7, py-1.4.33, pluggy-0.4.0
rootdir: /home/wesm/arrow-clone/python, inifile:
collected 1061 items / 1 skipped
[... test output not shown here ...]
============================== warnings summary ===============================
[... many warnings not shown here ...]
====== 1000 passed, 56 skipped, 6 xfailed, 19 warnings in 26.52 seconds =======
To build a self-contained wheel (including the Arrow and Parquet C++
libraries), one can set --bundle-arrow-cpp
:
pip install wheel # if not installed
python setup.py build_ext --build-type=$ARROW_BUILD_TYPE \
--bundle-arrow-cpp bdist_wheel
Building with CUDA support¶
The pyarrow.cuda
module offers support for using Arrow platform
components with Nvidia’s CUDA-enabled GPU devices. To build with this support,
pass -DARROW_CUDA=ON
when building the C++ libraries, and set the following
environment variable when building pyarrow:
export PYARROW_WITH_CUDA=1
Building on Windows¶
First, we bootstrap a conda environment similar to above, but skipping some of the Linux/macOS-only packages:
First, starting from fresh clones of Apache Arrow:
git clone https://github.com/apache/arrow.git
conda create -y -n pyarrow-dev -c conda-forge ^
--file arrow\ci\conda_env_cpp.yml ^
--file arrow\ci\conda_env_python.yml ^
python=3.7
conda activate pyarrow-dev
Now, we build and install Arrow C++ libraries
mkdir cpp\build
cd cpp\build
set ARROW_HOME=C:\thirdparty
cmake -G "Visual Studio 14 2015 Win64" ^
-DCMAKE_INSTALL_PREFIX=%ARROW_HOME% ^
-DCMAKE_BUILD_TYPE=Release ^
-DARROW_BUILD_TESTS=on ^
-DARROW_CXXFLAGS="/WX /MP" ^
-DARROW_GANDIVA=on ^
-DARROW_PARQUET=on ^
-DARROW_PYTHON=on ..
cmake --build . --target INSTALL --config Release
cd ..\..
After that, we must put the install directory’s bin path in our %PATH%
:
set PATH=%ARROW_HOME%\bin;%PATH%
Now, we can build pyarrow:
cd python
python setup.py build_ext --inplace --with-parquet
Then run the unit tests with:
py.test pyarrow -v
Running C++ unit tests for Python integration¶
Getting python-test.exe
to run is a bit tricky because your
%PYTHONHOME%
must be configured to point to the active conda environment:
set PYTHONHOME=%CONDA_PREFIX%
Now python-test.exe
or simply ctest
(to run all tests) should work.