.. _strings: .. ipython:: python :suppress: import numpy as np import pandas as pd np.set_printoptions(precision=4, suppress=True) pd.options.display.max_rows = 100 ============================================= Internals: Enhanced string / UTF-8 handling ============================================= There are some things we can do to make pandas use less memory and perform computations significantly faster on string data. Current string problems ======================= pandas offers support for columns containing strings (ASCII or Unicode) on a somewhat ad hoc basis. * Strings are stored in NumPy arrays of ``PyObject*`` / ``numpy.object_`` dtype. This has several problems * Computations (e.g. ``groupby`` operations) typically utilize a code path for generic Python objects. For example comparisons or hashing goes through the ``PyObject_*`` C API functions. In addition to harming multithreading due to GIL contention (you must acquire the GIL to use these functions), these can also be significantly slower than algorithms that operate on ``const char*``, potentially taking advantage of hardware optimizations. * String arrays often feature many copies of or references to the same PyString. Thus, some algorithms may perform redundant computation. Some parts of pandas, like ``pandas.read_csv``, make an effort to deduplicate strings to free memory and accelerate computations (e.g. if you do ``x == y``, and ``x`` and ``y`` are references to the same ``PyObject*``, Python skips comparing their internal data). * Note that this is somewhat mitigated by using ``pandas.Categorical``, but this is not the default storage mechanism. More on this below. * Using ``PyString`` objects and ``PyObject*`` NumPy storage adds non-trivial overhead (52 bytes in Python 3, slightly less in Python 2, see `this exposition `_ for a deeper drive) to each value. Possible solution: new non-NumPy string memory layout ===================================================== My proposed solution to the string conundrum is the following: * Create a custom string array container type suitable for use in a ``pandas.Array``, and a ``pandas.string`` logical data type. * Require that all strings be encoded as UTF-8. * By default, represent all string arrays internally as dictionary-encoded a.k.a. categorical. Thus, we will typically only ever have 1 copy of any given string in an array. * Store the actual string data in a packed UTF-8 buffer. I have seen this in a number of places, but notably it's the way that `Apache Arrow implements variable-length collections `_. Here is one possible C struct-like layout of this container: .. code-block:: c++ typedef struct { /* Category / dictionary indices into the string data */ uint32_t* indices; /* The encoded string lengths */ uint32_t* offsets; /* The packed UTF-8 data */ const char* data; /* For nullness */ uint8_t* bitmap; } string_array_t; Here's an example of what the data would look like: .. code-block:: text actual data : ['foo', 'bars', 'foo', null, 'bars'] indices: [0, 1, 0, 0, 1] bitmap[0] bitmap (read right-to-left): 0 0 0 1 0 1 1 1 | offsets: [0, 3, 7] data: ['f', 'o', 'o', 'b', 'a', 'r', 's'] Some benefits of this approach include: * Much better data locality for low-cardinality categorical data * 8.125 bytes (8 bytes plus 1 bit) of memory overhead per value versus 33 to 52 bytes (the current). * The data is already categorical: cast to ``category`` dtype can be perform very cheaply and without duplicating the underlying string memory buffer * Computations like ``groupby`` on dictionary-encoded strings will be as performant as those on Categorical currently are. performant Some drawbacks * This memory layout is best used as an immutable representation. Mutating slots here becomes more complex. Whether single value assignments or put / array-assignment may likely require constructing a new ``data`` buffer (either by ``realloc`` or some other copying mechanism). Without a compaction / "garbage collection" step on this buffer it will be possible to have "dead" memory inside it (for example, if you did ``arr[:] = 'a-new-string-value'``, all the existing values would be orphaned). * Some systems have addressed this issue by storing all string data in a "global string hash table". This is something we could explore, but it would add quite a bit of complexity to implement and may not be worthwhile at this time. * Indexing into this data structure to obtain a single Python object will probably want to call ``PyUnicode_FromStringAndSize`` to construct a string (Python 3, therefore Unicode). This requires a memory allocation, whereas it currently only has to do a ``Py_INCREF``. * Many of pandas's existing algorithms assuming Python objects would need to be specialized to take advantage of this new memory layout. This is both a pro and a con as it will most likely yield significantly better performance. Concerns / problems =================== Preserving code that assumes PyString objects ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Any alternate UTF-8 string in-memory representation should necessarily be able to yield Python string objects using ``PyUnicode_FromStringAndSize``. Thus, code like this could continue to work: .. ipython:: python s = pd.Series(["como estás?"]) s.map(lambda x: x.upper()) One trade-off is that creating the temporary Python strings is potentially costly. This could be mitigated for Python ``str`` methods (optimized array-oriented code path under the hood), but for arbitrary functions you would have to pay. Accommodating Non-UTF-8 data ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Some pandas users will have code that involves various non-UTF-8 Python string types: * Native unicode: Py_UCS1, Py_UCS2, Py_UCS4 * Non-UTF-8 PyBytes .. ipython:: python s = pd.Series(["como estás?"]) s s.str.encode('latin-1') s.str.encode('latin-1').str.decode('latin-1') Such data could arise from reading a CSV file in a non-UTF-8 encoding, and you did not indicate the encoding to ``pandas.read_csv``. My proposed solution to this is to provide a ``binary`` logical type having the same physical memory layout as UTF-8 strings, with only the metadata being different. So you would have the following semantics: * ``latin1_s = s.encode('latin-1')``: this yields a ``binary`` view and allocates new memory. * ``utf8_s = s.encode('utf-8')``: this is a no-op, but yields a ``binary`` view. * ``s2 = utf8_s.decode('utf-8')``: this requires using a Unicode codec to validate indicated codec. Indexing and slicing ~~~~~~~~~~~~~~~~~~~~ Storing strings as UTF-8 bytes means that things like this become more complicated: .. ipython:: python s = pd.Series(["estás está estáis"]) s.str[9] s.str[6:10] Since UTF-8 is a variable length encoding, finding the logical character by position will need to make use of the Python C API (expensive, requires creating new Python objects) or a 3rd party library. We could make use of the `ICU C++ Libraries `_ to implement this.