Efficient processing pipelines for large scale molecular datasets in Python

Thursday 11:05 in room 1.20 (ground floor, shannon)

Efficient processing pipelines for large scale molecular datasets in Python

Franciszek Job

Problem Statement Public molecular collections are fragmented across multiple repositories, e.g. PubChem, UniChem, or COCONUT. Each has a distinct data representation, inclusion criteria, and other idiosyncrasies. This makes it difficult to assemble a single, high-quality dataset, e.g. for pretraining molecular foundation models, aimed at varied chemoinformatics applications.

Software Our framework accepts one or more raw chemical datasets as inputs which are then delegated into their respective pipelines (described below). The output is a single, cleaned master dataset.

Pipeline Overview Pipeline consumes raw data from each source and executes a sequence of steps:

Download: fetch original files
Preprocess: parse and normalize input formats
Standardize: sanitize and canonicalize SMILES via RDKit
Deduplicate: remove duplicates using InChI keys (per source, then globally) The final output is a merged “master” dataset ready for downstream analysis or modeling.

Technologies & Tools

Dask - distributed and parallel computing framework. Pipelines require significant use of Python libraries, hence we need a tool which handles parallel Python interpreters very well.
RDKit - for chemical operations. Thanks to its core written in C++,the operations are high-performance.
scikit-fingerprints - for molecular filters

Intended Uses This software enables users to:

Pretrain large ML models on unified chemical space
Compare dataset cardinality, functional-group distributions, Bemis–Murcko scaffolds, and Circles metrics
Benchmark self-supervised architectures (e.g., Mol2Vec, MolFormer) across varied dataset sizes and chemical domains
Integrate new data sources or custom filters as requirements evolve

Franciszek Job

Software engineer at Software Mansion Computer Science engineering undergraduate (3rd year) Mainly used technologies: Python, Rust