Two-dimensional Datasets¶
Columns¶
-
class
Column
¶ An immutable column data structure consisting of a field (type metadata) and a chunked data array.
Public Functions
Construct a column from a vector of arrays.
The array chunks’ datatype must match the field’s datatype.
Construct a column from a chunked array.
The chunked array’s datatype must match the field’s datatype.
Construct a column from a single array.
The array’s datatype must match the field’s datatype.
Construct a column from a name and an array.
A field with the given name and the array’s datatype is automatically created.
Construct a column from a name and a chunked array.
A field with the given name and the array’s datatype is automatically created.
-
const std::string &
name
() const¶ The column name.
- Return
- the column’s name in the passed metadata
-
std::shared_ptr<DataType>
type
() const¶ The column type.
- Return
- the column’s type according to the metadata
-
std::shared_ptr<ChunkedArray>
data
() const¶ The column data as a chunked array.
- Return
- the column’s data as a chunked logical array
-
std::shared_ptr<Column>
Slice
(int64_t offset, int64_t length) const¶ Construct a zero-copy slice of the column with the indicated offset and length.
- Return
- a new object wrapped in std::shared_ptr<Column>
- Parameters
offset
: the position of the first element in the constructed slicelength
: the length of the slice. If there are not enough elements in the column, the length will be adjusted accordingly
Flatten this column as a vector of columns.
- Parameters
pool
: The pool for buffer allocations, if anyout
: The resulting vector of arrays
-
bool
Equals
(const Column &other) const¶ Determine if two columns are equal.
Two columns can be equal only if they have equal datatypes. However, they may be equal even if they have different chunkings.
Determine if the two columns are equal.
Tables¶
-
class
Table
¶ Logical table as sequence of chunked arrays.
Public Functions
-
std::shared_ptr<Column>
GetColumnByName
(const std::string &name) const¶ Return a column by name.
- Return
- an Array or null if no field was found
- Parameters
name
: field name
Remove column from the table, producing a new Table.
Add column to the table, producing a new Table.
Replace a column in the table, producing a new Table.
Replace schema key-value metadata with new metadata (EXPERIMENTAL)
- Since
- 0.5.0
- Return
- new Table
- Parameters
metadata
: new KeyValueMetadata
Flatten the table, producing a new Table.
Any column with a struct type will be flattened into multiple columns
- Parameters
pool
: The pool for buffer allocations, if anyout
: The returned table
-
int
num_columns
() const¶ Return the number of columns in the table.
-
int64_t
num_rows
() const¶ Return the number of rows (equal to each column’s logical length)
Public Static Functions
Construct Table from schema and columns If columns is zero-length, the table’s number of rows is zero.
- Parameters
schema
: The table schema (column types)columns
: The table’s columnsnum_rows
: number of rows in table, -1 (default) to infer from columns
Construct Table from schema and arrays.
- Parameters
schema
: The table schema (column types)arrays
: The table’s columns as arraysnum_rows
: number of rows in table, -1 (default) to infer from columns
Construct table from RecordBatches, using schema supplied by the first RecordBatch.
- Return
- Status Returns Status::Invalid if there is some problem
- Parameters
batches
: a std::vector of record batchestable
: the returned table
Construct table from RecordBatches, using supplied schema.
There may be zero record batches
- Return
- Status
- Parameters
schema
: the arrow::Schema for each batchbatches
: a std::vector of record batchestable
: the returned table
-
std::shared_ptr<Column>
Construct table from multiple input tables.
The tables are concatenated vertically. Therefore, all tables should have the same schema. Each column in the output table is the result of concatenating the corresponding columns in all input tables.
Record Batches¶
-
class
RecordBatch
¶ Collection of equal-length arrays matching a particular Schema.
A record batch is table-like data structure that is semantically a sequence of fields, each a contiguous Arrow array
Public Functions
-
bool
Equals
(const RecordBatch &other) const¶ Determine if two record batches are exactly equal.
- Return
- true if batches are equal
-
bool
ApproxEquals
(const RecordBatch &other) const¶ Determine if two record batches are approximately equal.
-
virtual std::shared_ptr<Array>
column
(int i) const = 0¶ Retrieve an array from the record batch.
- Return
- an Array object
- Parameters
i
: field index, does not boundscheck
-
std::shared_ptr<Array>
GetColumnByName
(const std::string &name) const¶ Retrieve an array from the record batch.
- Return
- an Array or null if no field was found
- Parameters
name
: field name
-
virtual std::shared_ptr<ArrayData>
column_data
(int i) const = 0¶ Retrieve an array’s internaldata from the record batch.
- Return
- an internal ArrayData object
- Parameters
i
: field index, does not boundscheck
Add column to the record batch, producing a new RecordBatch.
- Parameters
i
: field index, which will be boundscheckedfield
: field to be addedcolumn
: column to be addedout
: record batch with column added
Add new nullable column to the record batch, producing a new RecordBatch.
For non-nullable columns, use the Field-based version of this method.
- Parameters
i
: field index, which will be boundscheckedfield_name
: name of field to be addedcolumn
: column to be addedout
: record batch with column added
Remove column from the record batch, producing a new RecordBatch.
- Parameters
i
: field index, does boundscheckout
: record batch with column removed
-
const std::string &
column_name
(int i) const¶ Name in i-th column.
-
int
num_columns
() const¶ - Return
- the number of columns in the table
-
int64_t
num_rows
() const¶ - Return
- the number of rows (the corresponding length of each column)
-
virtual std::shared_ptr<RecordBatch>
Slice
(int64_t offset) const¶ Slice each of the arrays in the record batch.
- Return
- new record batch
- Parameters
offset
: the starting offset to slice, through end of batch
-
virtual std::shared_ptr<RecordBatch>
Slice
(int64_t offset, int64_t length) const = 0¶ Slice each of the arrays in the record batch.
- Return
- new record batch
- Parameters
offset
: the starting offset to slicelength
: the number of elements to slice from offset
Public Static Functions
- Parameters
schema
: The record batch schemanum_rows
: length of fields in the record batch. Each array should have the same length as num_rowscolumns
: the record batch fields as vector of arrays
Move-based constructor for a vector of Array instances.
Construct record batch from vector of internal data structures.
This class is only provided with an rvalue-reference for the input data, and is intended for internal use, or advanced users.
- Since
- 0.5.0
- Parameters
schema
: the record batch schemanum_rows
: the number of semantic rows in the record batch. This should be equal to the length of each fieldcolumns
: the data for the batch’s columns
Construct record batch by copying vector of array data.
- Since
- 0.5.0
-
bool
-
class
RecordBatchReader
¶ Abstract interface for reading stream of record batches.
Subclassed by arrow::flight::FlightMessageReader, arrow::ipc::RecordBatchStreamReader, arrow::TableBatchReader
Public Functions
-
virtual std::shared_ptr<Schema>
schema
() const = 0¶ - Return
- the shared schema of the record batches in the stream
Read the next record batch in the stream.
Return null for batch when reaching end of stream
- Return
- Status
- Parameters
batch
: the next loaded batch, null at end of stream
Consume entire stream as a vector of record batches.
Read all batches and concatenate as arrow::Table.
-
virtual std::shared_ptr<Schema>
-
class
TableBatchReader
: public arrow::RecordBatchReader¶ Compute a stream of record batches from a (possibly chunked) Table.
The conversion is zero-copy: each record batch is a view over a slice of the table’s columns.
Public Functions
-
TableBatchReader
(const Table &table)¶ Construct a TableBatchReader for the given table.
-
std::shared_ptr<Schema>
schema
() const¶ - Return
- the shared schema of the record batches in the stream
Read the next record batch in the stream.
Return null for batch when reaching end of stream
- Return
- Status
- Parameters
batch
: the next loaded batch, null at end of stream
-
void
set_chunksize
(int64_t chunksize)¶ Set the desired maximum chunk size of record batches.
The actual chunk size of each record batch may be smaller, depending on actual chunking characteristics of each table column.
-