Hey, I’m Hendrik!#

I am a software and data engineer building systems at the intersection of large-scale data management and machine learning. Currently, I work as an Open Source Engineer at Coiled maintaining and improving Dask and its distributed execution engine.

To learn more about me, click here.

Feel free to reach out if you work on a compelling idea, company or open-source project in the space of large-scale data or machine learning systems, or just want to have chat!

Recent blog posts#

  • Dask performance benchmarking put to the test: Fixing a pandas bottleneck - 2023-06-23

    Getting notified of a significant performance regression the day before release sucks, but quickly identifying and resolving it feels great!

    We were getting set up at our booth at JupyterCon 2023 when we received a notification: An engineer on our team had spotted a significant performance regression in Dask. With an impact of 40% increased runtime, it blocked the release planned for the next day!

  • Observability for Distributed Computing with Dask - 2023-05-16

    Debugging is hard. Distributed debugging is hell.

    When dealing with unexpected issues in a distributed system, you need to understand what and why it happened, how interactions between individual pieces contributed to the problems, and how to avoid them in the future. In other words, you need observability. This article explains what observability is, how Dask implements it, what pain points remain, and how Coiled helps you overcome these.

    The Coiled metrics dashboard provides observability into a Dask cluster and its workloads.
  • Shuffling large data at constant memory in Dask - 2023-03-15

    With release 2023.2.1, dask.dataframe introduces a new shuffling method called P2P, making sorts, merges, and joins faster and using constant memory. Benchmarks show impressive improvements:

    P2P shuffling uses constant memory while task-based shuffling scales linearly.
  • Personalization versus ‘Filter Bubble’: The Influence of Personalization on the Quality of Search Queries - 2018-10-17

    With the accelerating speed of data generation, it becomes increasingly important to develop and improve techniques which help us find the most relevant information for any given question. Personalization as a solution is satisfactory for many use cases, such as on-site search in e-commerce or many queries on general purpose search engines. For certain queries, such as those on controversial political topics, however, result diversity is important to diminish the negative effects of personalization on civic discourse and democracy.