Arrays¶
The central type in Arrow is the class arrow::Array
. An array
represents a known-length sequence of values all having the same type.
Internally, those values are represented by one or several buffers, the
number and meaning of which depend on the array’s data type, as documented
in the Arrow data layout specification.
Those buffers consist of the value data itself and an optional bitmap buffer that indicates which array entries are null values. The bitmap buffer can be entirely omitted if the array is known to have zero null values.
There are concrete subclasses of arrow::Array
for each data type,
that help you access individual values of the array.
Building an array¶
As Arrow objects are immutable, there are classes provided that help you
build these objects incrementally from third-party data. These classes
are organized in a hierarchy around the arrow::ArrayBuilder
base class,
with concrete subclasses tailored for each particular data type.
For example, to build an array of int64_t
elements, we can use the
arrow::Int64Builder
class. In the following example, we build an array
of the range 1 to 8 where the element that should hold the value 4 is nulled:
arrow::Int64Builder builder;
builder.Append(1);
builder.Append(2);
builder.Append(3);
builder.AppendNull();
builder.Append(5);
builder.Append(6);
builder.Append(7);
builder.Append(8);
std::shared_ptr<arrow::Array> array;
arrow::Status st = builder.Finish(&array);
if (!st.ok()) {
// ... do something on array building failure
}
The resulting Array (which can be casted to the concrete arrow::Int64Array
subclass if you want to access its values) then consists of two
arrow::Buffer
s.
The first buffer holds the null bitmap, which consists here of a single byte with
the bits 0|0|0|0|1|0|0|0
. As we use least-significant bit (LSB) numbering.
this indicates that the fourth entry in the array is null. The second
buffer is simply an int64_t
array containing all the above values.
As the fourth entry is null, the value at that position in the buffer is
undefined.
Here is how you could access the concrete array’s contents:
// Cast the Array to its actual type to access its data
auto int64_array = std::static_pointer_cast<arrow::Int64Array>(array);
// Get the pointer to the null bitmap.
const uint8_t* null_bitmap = int64_array->null_bitmap_data();
// Get the pointer to the actual data
const int64_t* data = int64_array->raw_values();
// Alternatively, given an array index, query its null bit and value directly
int64_t index = 2;
if (!int64_array->IsNull(index)) {
int64_t value = int64_array->Value(index);
}
Note
arrow::Int64Array
(respectively arrow::Int64Builder
) is
just a typedef
, provided for convenience, of arrow::NumericArray<Int64Type>
(respectively arrow::NumericBuilder<Int64Type>
).
Performance¶
While it is possible to build an array value-by-value as in the example above,
to attain highest performance it is recommended to use the bulk appending
methods (usually named AppendValues
) in the concrete arrow::ArrayBuilder
subclasses.
If you know the number of elements in advance, it is also recommended to
presize the working area by calling the Resize()
or Reserve()
methods.
Here is how one could rewrite the above example to take advantage of those APIs:
arrow::Int64Builder builder;
// Make place for 8 values in total
builder.Resize(8);
// Bulk append the given values (with a null in 4th place as indicated by the
// validity vector)
std::vector<bool> validity = {true, true, true, false, true, true, true, true};
std::vector<int64_t> values = {1, 2, 3, 0, 5, 6, 7, 8};
builder.AppendValues(values, validity);
std::shared_ptr<arrow::Array> array;
arrow::Status st = builder.Finish(&array);
If you still must append values one by one, some concrete builder subclasses have methods marked “Unsafe” that assume the working area has been correctly presized, and offer higher performance in exchange:
arrow::Int64Builder builder;
// Make place for 8 values in total
builder.Resize(8);
builder.UnsafeAppend(1);
builder.UnsafeAppend(2);
builder.UnsafeAppend(3);
builder.UnsafeAppendNull();
builder.UnsafeAppend(5);
builder.UnsafeAppend(6);
builder.UnsafeAppend(7);
builder.UnsafeAppend(8);
std::shared_ptr<arrow::Array> array;
arrow::Status st = builder.Finish(&array);
Size Limitations and Recommendations¶
Some array types are structurally limited to 32-bit sizes. This is the case for list arrays (which can hold up to 2^31 elements), string arrays and binary arrays (which can hold up to 2GB of binary data), at least. Some other array types can hold up to 2^63 elements in the C++ implementation, but other Arrow implementations can have a 32-bit size limitation for those array types as well.
For these reasons, it is recommended that huge data be chunked in subsets of more reasonable size.
Chunked Arrays¶
A arrow::ChunkedArray
is, like an array, a logical sequence of values;
but unlike a simple array, a chunked array does not require the entire sequence
to be physically contiguous in memory. Also, the constituents of a chunked array
need not have the same size, but they must all have the same data type.
A chunked array is constructed by agregating any number of arrays. Here we’ll build a chunked array with the same logical values as in the example above, but in two separate chunks:
std::vector<std::shared_ptr<arrow::Array>> chunks;
std::shared_ptr<arrow::Array> array;
// Build first chunk
arrow::Int64Builder builder;
builder.Append(1);
builder.Append(2);
builder.Append(3);
if (!builder.Finish(&array).ok()) {
// ... do something on array building failure
}
chunks.push_back(std::move(array));
// Build second chunk
builder.Reset();
builder.AppendNull();
builder.Append(5);
builder.Append(6);
builder.Append(7);
builder.Append(8);
if (!builder.Finish(&array).ok()) {
// ... do something on array building failure
}
chunks.push_back(std::move(array));
auto chunked_array = std::make_shared<arrow::ChunkedArray>(std::move(chunks));
assert(chunked_array->num_chunks() == 2);
// Logical length in number of values
assert(chunked_array->length() == 8);
assert(chunked_array->null_count() == 1);
Slicing¶
Like for physical memory buffers, it is possible to make zero-copy slices
of arrays and chunked arrays, to obtain an array or chunked array referring
to some logical subsequence of the data. This is done by calling the
arrow::Array::Slice()
and arrow::ChunkedArray::Slice()
methods,
respectively.