Monday 15:30 in room 1.38 (ground floor)

Compress, Compute, and Conquer: Python-Blosc2 for Efficient Data Analysis

Francesc Alted, Luke Shaw

Blosc and Blosc2 are well-known and widely used libraries for high-performance data compression. They are particularly effective for compressing large datasets, such as those encountered in data science and high-performance computing. The Blosc library has been around for more than a decade, and its design has always prioritized speed, with a focus on achieving compression and decompression speeds that are close to or even exceed memory bandwidth limits.

With the introduction of a new compute engine in Python-Blosc2 3.0, the guiding principle has evolved to "Compress Better, Compute Bigger." This enhancement enables computations on datasets that are over 100 times larger than the available RAM, all while maintaining high performance.

In this hands-on tutorial, participants will learn how to effectively use Python-Blosc2 through practical exercises divided into four sections:

Section 1: Getting Started with Python-Blosc2 (20 minutes)

Section 2: Integration with NumPy and the Python Data Ecosystem (20 minutes)

Section 3: The Compute Engine (30 minutes)

Section 4: Advanced Usage and Real-world Applications (20 minutes)

Throughout the tutorial, we'll work with practical examples demonstrating how to analyze datasets that exceed available RAM without specialized hardware. By the end, participants will have hands-on experience implementing Python-Blosc2 in data workflows and will understand how to compress data while maintaining computational efficiency.

This tutorial will help you expand your capabilities for scientific computing and data analysis while reducing memory footprint and improving processing speed. Attendees should bring laptops with Python installed; pre-tutorial setup instructions will be provided.

Francesc Alted

I am a curious person who studied Physics and Applied Maths. I spent over a year at CERN for my MSc in High Energy Physics. However, I found maths and computer sciences equally fascinating, so I left academia to pursue these fields. Over the years, I developed a passion for handling large datasets and using compression to enable their analysis on commodity hardware accessible to everyone.

I am the CEO of ironArray SLU and also leading the Blosc Development Team. I am very excited in working in providing a way for sharing Blosc2 datasets in the network in an easy and effective way via Caterva2, and Cat2Cloud, a software as a service that we are introducing.

As an Open Source believer, I started the PyTables project more than 20 years ago. After 25 years in this business, I started several other useful open source projects like Blosc, Caterva2 and Btune; those efforts won me two prizes that mean a lot to me:

You can know more on what I am working on by reading my latest blogs.

Luke Shaw

2019 BS in Physics (Princeton University), cum laude 2020 MSc in Applied Mathematics (University of Edinburgh), with distinction 2024 PhD in Applied Mathematics (Universitat Jaume I), sobresaliente cum laude