Thursday 11:05
in room 1.19 (ground floor)
Efficient processing pipelines for large scale molecular datasets in Python
Franciszek Job
Problem Statement Public molecular collections are fragmented across multiple repositories, e.g. PubChem, UniChem, or COCONUT. Each has a distinct data representation, inclusion criteria, and other idiosyncrasies. This makes it difficult to assemble a single, high-quality dataset, e.g. for pretraining molecular foundation models, aimed at varied chemoinformatics applications.
Software Our framework accepts one or more raw chemical datasets as inputs which are then delegated into their respective pipelines (described below). The output is a single, cleaned master dataset.
Pipeline Overview Pipeline consumes raw data from each source and executes a sequence of steps:
- Download: fetch original files
- Preprocess: parse and normalize input formats
- Standardize: sanitize and canonicalize SMILES via RDKit
- Deduplicate: remove duplicates using InChI keys (per source, then globally) The final output is a merged “master” dataset ready for downstream analysis or modeling.
Technologies & Tools
- Dask - distributed and parallel computing framework. Pipelines require significant use of Python libraries, hence we need a tool which handles parallel Python interpreters very well.
- RDKit - for chemical operations. Thanks to its core written in C++,the operations are high-performance.
- scikit-fingerprints - for molecular filters
Intended Uses This software enables users to:
- Pretrain large ML models on unified chemical space
- Compare dataset cardinality, functional-group distributions, Bemis–Murcko scaffolds, and Circles metrics
- Benchmark self-supervised architectures (e.g., Mol2Vec, MolFormer) across varied dataset sizes and chemical domains
- Integrate new data sources or custom filters as requirements evolve