Posts tagged pandas

Dask performance benchmarking put to the test: Fixing a pandas bottleneck

Getting notified of a significant performance regression the day before release sucks, but quickly identifying and resolving it feels great!

We were getting set up at our booth at JupyterCon 2023 when we received a notification: An engineer on our team had spotted a significant performance regression in Dask. With an impact of 40% increased runtime, it blocked the release planned for the next day!

Read more ...


Shuffling Large Data at Constant Memory in Dask | Dask Demo Day 2023-03

Debugging is hard. Distributed debugging is hell.

Dask is a popular library for parallel and distributed computing in Python. In this demo, we showcase the recent scalability and performance improvements in the dask.dataframe API that were enabled by my work on the new P2P shuffling system.

Read more ...