Internals: Enhanced string / UTF-8 handling¶

There are some things we can do to make pandas use less memory and perform computations significantly faster on string data.

Current string problems¶

pandas offers support for columns containing strings (ASCII or Unicode) on a somewhat ad hoc basis.

Strings are stored in NumPy arrays of PyObject* / numpy.object_ dtype. This has several problems
- Computations (e.g. groupby operations) typically utilize a code path for generic Python objects. For example comparisons or hashing goes through the PyObject_* C API functions. In addition to harming multithreading due to GIL contention (you must acquire the GIL to use these functions), these can also be significantly slower than algorithms that operate on const char*, potentially taking advantage of hardware optimizations.
- String arrays often feature many copies of or references to the same PyString. Thus, some algorithms may perform redundant computation. Some parts of pandas, like pandas.read_csv, make an effort to deduplicate strings to free memory and accelerate computations (e.g. if you do x == y, and x and y are references to the same PyObject*, Python skips comparing their internal data).
  - Note that this is somewhat mitigated by using pandas.Categorical, but this is not the default storage mechanism. More on this below.
- Using PyString objects and PyObject* NumPy storage adds non-trivial overhead (52 bytes in Python 3, slightly less in Python 2, see this exposition for a deeper drive) to each value.

Possible solution: new non-NumPy string memory layout¶

My proposed solution to the string conundrum is the following:

Create a custom string array container type suitable for use in a pandas.Array, and a pandas.string logical data type.
Require that all strings be encoded as UTF-8.
By default, represent all string arrays internally as dictionary-encoded a.k.a. categorical. Thus, we will typically only ever have 1 copy of any given string in an array.
Store the actual string data in a packed UTF-8 buffer. I have seen this in a number of places, but notably it’s the way that Apache Arrow implements variable-length collections.

Here is one possible C struct-like layout of this container:

typedef struct {
  /* Category / dictionary indices into the string data */
  uint32_t* indices;

  /* The encoded string lengths */
  uint32_t* offsets;

  /* The packed UTF-8 data */
  const char* data;

  /* For nullness */
  uint8_t* bitmap;
} string_array_t;

Here’s an example of what the data would look like:

actual data : ['foo', 'bars', 'foo', null, 'bars']

indices: [0, 1, 0, 0, 1]

                                 bitmap[0]
bitmap (read right-to-left): 0 0 0 1 0 1 1 1 |

offsets: [0, 3, 7]
data: ['f', 'o', 'o', 'b', 'a', 'r', 's']

Some benefits of this approach include:

Much better data locality for low-cardinality categorical data
8.125 bytes (8 bytes plus 1 bit) of memory overhead per value versus 33 to 52 bytes (the current).
The data is already categorical: cast to category dtype can be perform very cheaply and without duplicating the underlying string memory buffer
Computations like groupby on dictionary-encoded strings will be as performant as those on Categorical currently are. performant

Some drawbacks

This memory layout is best used as an immutable representation. Mutating slots here becomes more complex. Whether single value assignments or put / array-assignment may likely require constructing a new data buffer (either by realloc or some other copying mechanism). Without a compaction / “garbage collection” step on this buffer it will be possible to have “dead” memory inside it (for example, if you did arr[:] = 'a-new-string-value', all the existing values would be orphaned).
- Some systems have addressed this issue by storing all string data in a “global string hash table”. This is something we could explore, but it would add quite a bit of complexity to implement and may not be worthwhile at this time.
Indexing into this data structure to obtain a single Python object will probably want to call PyUnicode_FromStringAndSize to construct a string (Python 3, therefore Unicode). This requires a memory allocation, whereas it currently only has to do a Py_INCREF.
Many of pandas’s existing algorithms assuming Python objects would need to be specialized to take advantage of this new memory layout. This is both a pro and a con as it will most likely yield significantly better performance.

Concerns / problems¶

Preserving code that assumes PyString objects¶

Any alternate UTF-8 string in-memory representation should necessarily be able to yield Python string objects using PyUnicode_FromStringAndSize. Thus, code like this could continue to work:

In [1]: s = pd.Series(["como estás?"])

In [2]: s.map(lambda x: x.upper())
Out[2]: 
0    COMO ESTÁS?
dtype: object

One trade-off is that creating the temporary Python strings is potentially costly. This could be mitigated for Python str methods (optimized array-oriented code path under the hood), but for arbitrary functions you would have to pay.

Accommodating Non-UTF-8 data¶

Some pandas users will have code that involves various non-UTF-8 Python string types:

Native unicode: Py_UCS1, Py_UCS2, Py_UCS4
Non-UTF-8 PyBytes

In [3]: s = pd.Series(["como estás?"])

In [4]: s
Out[4]: 
0    como estás?
dtype: object

In [5]: s.str.encode('latin-1')
Out[5]: 
0    b'como est\xe1s?'
dtype: object

In [6]: s.str.encode('latin-1').str.decode('latin-1')
Out[6]: 
0    como estás?
dtype: object

Such data could arise from reading a CSV file in a non-UTF-8 encoding, and you did not indicate the encoding to pandas.read_csv.

My proposed solution to this is to provide a binary logical type having the same physical memory layout as UTF-8 strings, with only the metadata being different. So you would have the following semantics:

latin1_s = s.encode('latin-1'): this yields a binary view and allocates new memory.
utf8_s = s.encode('utf-8'): this is a no-op, but yields a binary view.
s2 = utf8_s.decode('utf-8'): this requires using a Unicode codec to validate indicated codec.

Indexing and slicing¶

Storing strings as UTF-8 bytes means that things like this become more complicated:

In [7]: s = pd.Series(["estás está estáis"])

In [8]: s.str[9]
Out[8]: 
0    á
dtype: object

In [9]: s.str[6:10]
Out[9]: 
0    está
dtype: object

Since UTF-8 is a variable length encoding, finding the logical character by position will need to make use of the Python C API (expensive, requires creating new Python objects) or a 3rd party library. We could make use of the ICU C++ Libraries to implement this.