Processing Cloud-optimized data in Python (Dataplug)
Universitat Rovira i Virgili (Pedro Garcia Lopez), Daniel Alejandro Coll Tejeda
Cloud-optimized (CO) data formats are designed to efficiently store and access data directly from cloud object storage without needing to download the entire dataset. These formats enable faster data retrieval, scalability, and cost-effectiveness by allowing users to fetch only the necessary subsets of data.
They also allow for efficient parallel data processing using on-the-fly partitioning, which can considerably accelerate data management operations. For example, Dask can efficiently read data in parallel from Object Storage in CO formats like ZARR.
Cloud-optimized formats are now widely used in geospatial settings with entire datasets available in the AWS Registry for Open Data like Sentinel-2 Cloud Optimized GeoTIFFs. In this line, COPC (Cloud Optimized Point Cloud) was developed to overcome the limitations of LIDAR. Likewise, Cloud Optimized GeoTIFF (COG) was developed to facilitate cloud processing of GeoTIFF files.
Nevertheless, there are no cloud optimized versions of widely used formats in genomics (FASTA, FASTQ, VCF, FASTQGZIP) and metabolomics (imzML). Furthermore, a costly preprocessing from legacy formats is required (from GeoTIFF to COG, from LIDAR to COPC). In this talk, we will present a novel data processing library called Dataplug that enables Cloud-optimized access to legacy formats without a costly preprocessing and also avoiding huge data movements. Dataplug covers legacy formats like LIDAR but also major data formats found in bioinformatics (genomics, metabolomics) that lack appropriate Cloud Optimized alternatives.
In this talk, you will learn how to process scientific data formats in Python using the Dataplug library from any Python data analytics platform like Dask or Ray. We will show different data processing pipelines in the Cloud that demonstrate the benefits of cloud-optimized data management.
Objectives
By the end of this talk, you will be able to:
- Understand Cloud-Optimized data formats and their benefits for data processing in the Cloud
- Learn how to process Cloud Optimized Data from Object Storage in Python using Dask
- Use Dataplug library to enable on-the-fly partitioning of Cloud Optimized data (COG, ZARR, COPC).
- Use Dataplug library to enable on-the-fly partitioning of non-Cloud Optimized formats (LIDAR, FASTQGZIP, FASTA, FASTQ, VCF,imzML, MS)
Outline
Introduction (10 minutes)
- About Us
- Understanding Cloud-Optimized data formats and Cloud Object storage
- Processing Cloud-Optimized data in Dask
Processing Cloud-optimized data in the Cloud with Python (15 minutes)
- Processing COG (Cloud-Optimized GeoTIFFs) in Python in the NDVI pipeline
- On-the-fly processing of compressed genomic data (FASTQGZIP) with Dataplug
- On-the-fly processing of metabolomics data (imzML) with Dataplug
- Commparing LIDAR and COPC processing with Dataplug library in Dask (code)
Conclusions (2 minutes)
Audience
The talk is aimed at Python developers interested in processing data in the Cloud. In particular, it may be of interest in the following domains: geospatial data (COG, COPC, LIDAR, ZARR, Kerchunk), genomics data (FASTA, FASTQ, VCF, FASTQGZIP), astronomics (MS), and metabolomics data (imzML). This talk requires basic understanding of Cloud Object Storage.