Internals: Data structure changes¶
Logical types and Physical Storage Decoupling¶
Since this is the most important, but perhaps also most controversial, change (in my opinion) to pandas, I’m going to go over it in great detail. I think the hardest part is coming up with clear language and definitions for concepts so that we can communicate effectively. For example the term “data type” is vague and may mean different things to different people.
A motivating example¶
Before digging too much into the technical details and problems/solutions, let’s look at some code examples. It is not unusual to find code like this in pandas’s internals:
def create_from_value(value, index, dtype):
# return a new empty value suitable for the dtype
if is_datetimetz(dtype):
subarr = DatetimeIndex([value] * len(index), dtype=dtype)
elif is_categorical_dtype(dtype):
subarr = Categorical([value] * len(index))
else:
if not isinstance(dtype, (np.dtype, type(np.dtype))):
dtype = dtype.dtype
subarr = np.empty(len(index), dtype=dtype)
subarr.fill(value)
or
if is_categorical_dtype(dtype):
upcast_cls = 'category'
elif is_datetimetz(dtype):
upcast_cls = 'datetimetz'
elif issubclass(dtype.type, np.bool_):
upcast_cls = 'bool'
elif issubclass(dtype.type, np.object_):
upcast_cls = 'object'
elif is_datetime64_dtype(dtype):
upcast_cls = 'datetime'
elif is_timedelta64_dtype(dtype):
upcast_cls = 'timedelta'
else:
upcast_cls = 'float'
I’ve cherry-picked one of a number of places where this type of datatype-based branching happens.
The primary reason for this complexity is that pandas is using both NumPy’s dtype objects (which describe physical storage) as well as its own custom data type objects as a proxy for pandas’s semantic logical types.
Let’s step back for a second and come up with clear language to steer the discussion.
Some definitions¶
Here is my attempt at definitions of some of the key terms:
- Metadata: data that describes other data (such as its in-memory layout)
- Semantics: The meaning / abstract interpretation of something. We often discuss the semantics (meaning) of computer programs (i.e. what they do, fundamentally) without touching upon low level details like machine representation, programming languages, compilers, operating systems, etc.
- Physical data (or storage) types: these are metadata objects which
provide a description of the precise structure of a piece of data in memory.
- In NumPy, the
numpy.dtype
object (akaPyArray_Descr
in the C API) is metadata describing a single cell / value in an array. Combined with theshape
andstrides
attributes of thendarray
object, you have enough information to perform O(1) random access on any cell in anndarray
and to assign these values to a C type (or, in the case, of structured dtypes, assign to a packed C struct). - This may or may not include a physical representation of NULL or missing data (for example: nullable float64 might be a physical type indicating a normal float64 array along with a bitmap of null/not-null indicators).
- In NumPy, the
- Logical data type: metadata which describes the semantic content of a
single value in an array or other collection of values. Depending on the
logical type, it may map 1-to-1 to a physical type or not at all. Here are
some examples:
- The
double
orfloat64
type may be viewed both as a logical type as well as a physical type (a 1-to-1 correspondence). - pandas’s
category
dtype contains its own auxiliary array of category values (for example, the distinct strings collected from a string array). Based on the number of categories, the categorycodes
(which reference the categories array) are stored in the smallest possible integer physical type (fromint8
toint64
, depending whether the data type can accommodate the codes). For example, if there are 50 codes, the data is represented inint8
storage. For 1000 codes, it would beint16
. - Another example: timestamps may be physically stored in
int64
storage, and these values are interpreted in the context of a particular time unit or resolution (e.g. nanoseconds, milliseconds, seconds).
- The
In general, new logical types may be formed either by placing new semantics on top of a single physical data type or some composition of physical or logical types. For example: you could have a categorical type (a logical construct consisting of multiple arrays of data) whose categories are some other logical type.
For historical reasons, pandas never developed a clear or clean semantic
separation in its user API between logical and physical data types. Also, the
addition of new, pandas-only “synthetic” dtypes that are unknown to NumPy (like
categorical, datetimetz, etc.) has expanded this conflation considerably. If
you also consider pandas’s custom missing / NULL data behavior, the addition of
ad hoc missing data semantics to a physical NumPy data type created, by the
definitions above, a logical data type (call it object[nullable]
for an
object array) without ever explicitly saying so.
You might be thinking, “Good job, Wes. You really messed that up!” I’d be inclined to agree with you now in retrospect, but back in 2011 pandas was not the super popular project that it is today, and we were truly riding on NumPy’s coat tails. So the extent to which NumPy concepts and APIs were used explicitly in pandas made the library easier to adopt. Now in 2016, this feels anachronistic / outdated.
High-level logical type proposal¶
As we have been discussing periodically on the pandas-dev mailing list and GitHub, I am proposing that we start to unravel our current mess by defining pandas-specific metadata objects that model the current semantics / behavior of the project. What does this mean, exactly?
- Each NumPy dtype object will map 1-to-1 to an equivalent
pandas.DataType
object. - Existing pandas “extension dtypes” (like
CategoricalDtype
andDatetimeTZDtype
), which have been designed to mimicnumpy.dtype
, will become logical type subclasses ofpandas.DataType
like every other type in pandas.
Since pandas is about assisting with data manipulation and analysis, at some
point you must invoke functions that are specialized to the specific physical
memory representation of your data. For example, pandas has its own
implementations of ndarray.take
that are used internally for arrays of
positive integers that may contain NULL / NA values (which are represented as
-1 – search the codebase for implementations of take_1d
).
The major goals of introducing a logical type abstraction are the follows:
- Simplifying “dynamic dispatch”: invoking the right functions or choosing the right code branches based on the data type.
- Enabling pandas to decouple both its internal semantics and physical storage
from NumPy’s metadata and APIs. Note that this is already happening with
categorical types, since a particular instance of
CategoricalDtype
may physically be stored in one of 4 NumPy data types.
Physical storage decoupling¶
By separating pandas data from the presumption of using a particular physical
numpy.dtype
internally, we can:
- Begin to better protect users from NumPy data semantics (which are frequently different from pandas’s!) leaking through to the pandas user API. This can enable us to address long-standing inconsistencies or “rough edges” in pandas that have persisted due to our tight semantic coupling to NumPy.
- We can consider adding new data structures to pandas, either custom to pandas
or provided by 3rd-party libraries, that add new functionality alongside the
existing code (presuming NumPy physical storage). As one concrete example,
discussed in more detail below, we can enable missing data in integer pandas
data by forming a composite data structure consisting of a NumPy array plus a
bitmap marking the null / not-null values.
- It may end up being a requirement that 3rd party data structures will need to have a C or C++ API to be used in pandas.
- We can start to think about improved behavior around data ownership (like copy-on-write) which may yield many benefits. I will write a dedicated section about this.
Note that neither of these points implies that we are trying to use NumPy
less. We already have large amounts of code that implement algorithms similar
to those found in NumPy (e.g. pandas.unique
or the implementation of
Series.sum
), but taking into account pandas’s missing data representation,
etc. Internally, we can use NumPy when its computational semantics match those
we’ve chosen for pandas, and elsewhere we can invoke pandas-specific code.
A major concern here based on these ideas is preserving NumPy interoperability, so I’ll examine this topic in some detail next.
Correspondence between logical and physical types¶
- Floating point numbers
- Logical:
Float16/32/64
- Physical:
numpy.float16/32/64
, withNaN
for null (for backwards compatibility)
- Logical:
- Signed Integers
- Logical:
Int8/16/32/64
- Physical:
numpy.int8/16/32/64
array plus nullness bitmap
- Logical:
- Unsigned Integers
- Logical:
UInt8/16/32/64
- Physical:
numpy.uint8/16/32/64
array plus nullness bitmap
- Logical:
- Boolean
- Logical:
Boolean
- Physical:
np.bool_
(a.k.a.np.uint8
) array plus nullness bitmap. We may also explore bit storage (versus bytes).
- Logical:
- Categorical
- Logical:
Categorical[T]
, whereT
is any other logical type - Physical: this type is a composition of a
Int8
throughInt64
(depending on the cardinality of the categories) plus the categories array. These have the same physical representation as
- Logical:
- String and Binary
- Logical:
String
andBinary
- Physical: Dictionary-encoded representation for UTF-8 and general binary data as described in the string section <strings>.
- Logical:
- Timestamp
- Logical:
Timestamp[unit]
, where unit is the resolution. Nanoseconds can continue to be the default unit for now - Physical:
numpy.int64
, withINT64_MIN
as the null value.
- Logical:
- Timedelta
- Logical:
Timedelta[unit]
, where unit is the resolution - Physical:
numpy.int64
, withINT64_MIN
as the null value.
- Logical:
- Period
- Logical:
Period[unit]
, where unit is the resolution - Physical:
numpy.int64
, withINT64_MIN
as the null value.
- Logical:
- Interval
- Logical:
Interval
- Physical: two arrays of
Timestamp[U]
– these may need to be forced to both be the same resolution
- Logical:
- Python objects (catch-all for other data types)
- Logical:
Object
- Physical:
numpy.object_
array, with None for null values (perhaps withnp.NaN
also for backwards compatibility)
- Logical:
- Complex numbers
- Logical:
Complex64/128
- Physical:
numpy.complex64/128
, withNaN
for null (for backwards compatibility)
- Logical:
Some notes on these:
- While a pandas (logical) type may map onto one or more physical
representations, in general NumPy types will map directly onto a pandas
type. Thus, existing code involving
numpy.dtype
-like objects (such as'f8'
ornumpy.float64
) will continue to work.
Preserving NumPy interoperability¶
Some of types of intended interoperability between NumPy and pandas are as follows:
- Access to internal data: Users can obtain the a
numpy.ndarray
(possibly a view depending on the internal block structure, more on this soon) in constant time and without copying the actual data. This has a couple other implications- Changes made to this array will be reflected in the source pandas object.
- If you write C extension code (possibly in Cython) and respect pandas’s missing data details, you can invoke certain kinds of fast custom code on pandas data (but it’s somewhat inflexible – see the latest discussion on adding a native code API to pandas).
- Ufuncs: NumPy ufuncs (like
np.sqrt
ornp.log
) can be invoked on pandas objects like Series and DataFrame - Array protocol:
numpy.asarray
will always yield some array, even if it discards metadata or has to create a new array. For exampleasarray
invoked onpandas.Categorical
yields a reconstructed array (rather than either the categories or codes internal arrays) - Interchangeability: Many NumPy methods designed to work on subclasses (or
duck-typed classes) of
ndarray
may be used. For examplenumpy.sum
may be used on a Series even though it does not invoke NumPy’s internal C sum algorithm. This means that a Series may be used as an interchangeable argument in a large set of functions that only know about NumPy arrays.
By and large, I think much of this can be preserved, but there will be some API breakage. In particular, interchangeability is not something we can or should guarantee.
If we add more composite data structures (Categorical can be thought of as one existing composite data structure) to pandas or alternate non-NumPy data structures, there will be cases where the semantic information in a Series cannot be adequately represented in a NumPy array.
As one example, if we add pandas-only missing data support to integer and
boolean data (a long requested feature), calling np.asarray
on such data
may not have well-defined behavior. As present, pandas is implicitly converting
these types to float64
(see more below), which isn’t too great. A decision
does not need to be made now, but the benefits of solving this long-standing
issue may merit breaking asarray
as long as we provide an explicit way to
obtain the original casted float64
NumPy array (with NaN
for NULL/NA
values)
For pandas data that does not step outside NumPy’s semantic realm, we can continue to provide zero-copy views in many cases.
Missing data consistency¶
Once the physical memory representation has been effectively decoupled from the user API, we can consider various approaches to implementing missing data in a consistent way for every logical pandas data type.
To motivate this, let’s look at some integer data:
In [1]: s = pd.Series([1, 2, 3, 4, 5])
In [2]: s
Out[2]:
0 1
1 2
2 3
3 4
4 5
dtype: int64
In [3]: s.dtype