Data Types¶
Data types govern how physical data is interpreted. Their specification allows binary interoperability between different Arrow
implementations, including from different programming languages and runtimes
(for example it is possible to access the same data, without copying, from
both Python and Java using the pyarrow.jvm
bridge module).
Information about a data type in C++ can be represented in three ways:
- Using a
arrow::DataType
instance (e.g. as a function argument) - Using a
arrow::DataType
concrete subclass (e.g. as a template parameter) - Using a
arrow::Type::type
enum value (e.g. as the condition of a switch statement)
The first form (using a arrow::DataType
instance) is the most idiomatic
and flexible. Runtime-parametric types can only be fully represented with
a DataType instance. For example, a arrow::TimestampType
needs to be
constructed at runtime with a arrow::TimeUnit::type
parameter; a
arrow::Decimal128Type
with scale and precision parameters;
a arrow::ListType
with a full child type (itself a
arrow::DataType
instance).
The two other forms can be used where performance is critical, in order to avoid paying the price of dynamic typing and polymorphism. However, some amount of runtime switching can still be required for parametric types. It is not possible to reify all possible types at compile time, since Arrow data types allows arbitrary nesting.
Creating data types¶
To instantiate data types, it is recommended to call the provided factory functions:
std::shared_ptr<arrow::DataType> type;
// A 16-bit integer type
type = arrow::int16();
// A 64-bit timestamp type (with microsecond granularity)
type = arrow::timestamp(arrow::TimeUnit::MICRO);
// A list type of single-precision floating-point values
type = arrow::list(arrow::float32());