Other miscellaneous ideas¶

Dropping Python 2 support¶

With Python 2.7 reaching its supported end-of-life in 2020, like some other Python projects (e.g. IPython / Jupyter) we should seriously contemplate making pandas 2.0 only support Python 3.5 and higher. In addition to lowering the development burden at both the C API and pure Python level, we can also finally look to take advantage of features (things like asyncio, maybe?) only available in Python 3.

Deprecated code to remove¶

.ix indexing entirely
Panel and PanelND classes
Plotting?

Other ideas¶

Here’s a collection of other miscellaneous ideas that don’t necessarily fit elsewhere in these documents.

Column statistics¶

In quite a few pandas algorithms, there are characteristics of the data that are very useful to know, such as:

Monotonicity: for comparable data (e.g. numbers), is the data sorted / strictly increasing? In time series, this permits sorting steps to be skipped.
Null count: for data not containing any nulls, the null handling path in some algorithms can be skipped entirely
Uniqueness: used in indexes, and can be helpful elsewhere

Strided arrays: more trouble than they are worth?¶

Per the general discussion around changing DataFrame’s internals to contain a list / std::vector of arrays internally, for me this begs the question of the benefits of continuing to accommodate strided one-dimensional data.

Some pros for eliminating strided data completely:

Guaranteeing contiguous memory internally will yield more consistent and predictable performance.
Not needing to consider a stride different from 1 means simpler low-level array indexing code (e.g. you can work with plain C arrays). The stride is a complexity / overhead that leaks to every algorithm that iterates over an array.
You avoid strange situations where a strided view holds onto a base ndarray reference to a much larger array
Example: https://github.com/wesm/feather/issues/97. Here, the internal orientation (column-major vs. row-major) is not clear to the user.

Some cons:

It would not be possible to perform zero-copy computations on a strided NumPy array
Relatedly, initializing a Series or DataFrame from strided memory would require allocating an equivalent amount of contiguous memory for each of the columns.

For me, at least, I don’t find the cons compelling enough to warrant the code complexity tradeoff.

Enforcing immutability in GroupBy functions¶

Side effects from groupby operations have been a common source of issues or unintuitive behavior for users.

Handling of sparse data structures¶

It’s possible that the sparse types could become first class logical types, e.g. Sparse[T], eliminating the Sparse* classes.