Pandas 1.0 faster and efficient, here are the top features that will blow your mind.

Sabber Ahamed
3 min readJan 13, 2020

The “Pandas” is an open-source python library used in our data science development work. The library provides easy-to-use data structures and data analysis tools. Although the Pandas is not the fastest data processing library, it is still one of the tools that we use in our daily data processing and analysis work.

Pandas recently published its first release candidate version (1.0.0rc0) on Jan 09, 2020. I immediately liked some of the features, especially the improvement of performance in some of the desired functions. In this article, I will discuss the top three methods (correlation, replace, and select_dtypes) where the performance improved significantly.

Correlation: The fascinating improvement happened in the rank based pairwise correlation of columns:

DataFrame.corr(self, method='spearman', min_periods=1)

Pearson based linear correlation is already improved and optimized; however, I did not like the “spearman” based correlation. It was slow and terrible when the dataset is bigger. That’s the reason I had to compute the rank first, then compute the Pearson correlation. One of the contributors to the library (@WillAyd) mentioned that the ranking was calculated inside a nested for loop within nancorr_spearman redundantly. In this release candidate version, the authors fixed the problem, and hopefully, the performance is better now. I believe it would be nice to see non-parametric based correlations algorithms (e.g., ACE, density-based, etc.) in future releases.

Replace: The second most exciting feature that got the performance improvement is the replace method of a dataframe. The method Replaces values given in to_replace with value.

DataFrame.replace(self, to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad')

The method replaces the values of a DataFrame with other values dynamically. Pandas contributor jbrockmendel helped to identify the bottleneck and made the method faster. The following code snippet is the author’s analysis that shows the performance boost of the newer version:

In [2]: df = pd.DataFrame({"A": 0, "B": 0}, index=range(4*10**7))In [3]: %timeit df.replace([np.inf, -np.inf], np.nan)
5.18 s ± 423 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # <-- master
414 ms ± 7.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # <-- PR
In [4]: %timeit df.replace([np.inf, -np.inf], np.nan, inplace=True)
2.89 s ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # <-- master
69.6 µs ± 4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) # <-- PR
In [5]: %timeit df.replace([np.inf, -np.inf, 1], np.nan)
4.88 s ± 228 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # <-- master
466 ms ± 11.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # <-- PR

Select data types: Another performance boost happened in the select_dtypes method. The method returns a subset of the data frame based on the column types.

DataFrame.select_dtypes(self, include=None, exclude=None)

Although the time complexity of the method is O(n), it performs slowly when a dataset has a massive number of features (10s of thousand features). datajanko mentioned that inferring data type for each feature takes the most time. The authors dug into the deep codes to pinpoint the lines that were causing the problem and found that the problem could be fixed using vectorization instead of iterating over a loop. So they fixed it. Ya!!

I think this new version of the Pandas is excellent and contains lots of other great features:

  1. Experimental NA scalar to denote missing values
  2. Dedicated string data type
  3. Dedicated string data type
  4. Using Numba in rolling.apply
  5. Converting to Markdown

and many more. You can update and try with the new version using pip:

pip install --upgrade pandas==1.0.0rc0

More details of this release version can be found on the Pandas website: https://pandas.pydata.org/pandas-docs/version/1.0.0/whatsnew/v1.0.0.html

I would love to hear what do you think about the improvements. So, please make comments and let’s discuss.

Email: sabbers@gmail.com
LinkedIn: https://www.linkedin.com/in/sabber-ahamed/
Github: https://github.com/msahamed
Medium: https://medium.com/@sabber/

--

--