.. Licensed to the Apache Software Foundation (ASF) under one .. or more contributor license agreements. See the NOTICE file .. distributed with this work for additional information .. regarding copyright ownership. The ASF licenses this file .. to you under the Apache License, Version 2.0 (the .. "License"); you may not use this file except in compliance .. with the License. You may obtain a copy of the License at .. http://www.apache.org/licenses/LICENSE-2.0 .. Unless required by applicable law or agreed to in writing, .. software distributed under the License is distributed on an .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY .. KIND, either express or implied. See the License for the .. specific language governing permissions and limitations .. under the License. Interprocess messaging / communication (IPC) ============================================ Encapsulated message format --------------------------- Data components in the stream and file formats are represented as encapsulated *messages* consisting of: * A length prefix indicating the metadata size * The message metadata as a `Flatbuffer`_ * Padding bytes to an 8-byte boundary * The message body, which must be a multiple of 8 bytes Schematically, we have: :: The complete serialized message must be a multiple of 8 bytes so that messages can be relocated between streams. Otherwise the amount of padding between the metadata and the message body could be non-deterministic. The ``metadata_size`` includes the size of the flatbuffer plus padding. The ``Message`` flatbuffer includes a version number, the particular message (as a flatbuffer union), and the size of the message body: :: table Message { version: org.apache.arrow.flatbuf.MetadataVersion; header: MessageHeader; bodyLength: long; } Currently, we support 4 types of messages: * Schema * RecordBatch * DictionaryBatch * Tensor Streaming format ---------------- We provide a streaming format for record batches. It is presented as a sequence of encapsulated messages, each of which follows the format above. The schema comes first in the stream, and it is the same for all of the record batches that follow. If any fields in the schema are dictionary-encoded, one or more ``DictionaryBatch`` messages will be included. ``DictionaryBatch`` and ``RecordBatch`` messages may be interleaved, but before any dictionary key is used in a ``RecordBatch`` it should be defined in a ``DictionaryBatch``. :: ... ... ... ... When a stream reader implementation is reading a stream, after each message, it may read the next 4 bytes to know how large the message metadata that follows is. Once the message flatbuffer is read, you can then read the message body. The stream writer can signal end-of-stream (EOS) either by writing a 0 length as an ``int32`` or simply closing the stream interface. File format ----------- We define a "file format" supporting random access in a very similar format to the streaming format. The file starts and ends with a magic string ``ARROW1`` (plus padding). What follows in the file is identical to the stream format. At the end of the file, we write a *footer* containing a redundant copy of the schema (which is a part of the streaming format) plus memory offsets and sizes for each of the data blocks in the file. This enables random access any record batch in the file. See ``File.fbs`` for the precise details of the file footer. Schematically we have: ::