Thursday 11:05 in room 1.19 (ground floor)

Efficient processing pipelines for large scale molecular datasets in Python

Franciszek Job

Problem Statement Public molecular collections are fragmented across multiple repositories, e.g. PubChem, UniChem, or COCONUT. Each has a distinct data representation, inclusion criteria, and other idiosyncrasies. This makes it difficult to assemble a single, high-quality dataset, e.g. for pretraining molecular foundation models, aimed at varied chemoinformatics applications.

Software Our framework accepts one or more raw chemical datasets as inputs which are then delegated into their respective pipelines (described below). The output is a single, cleaned master dataset.

Pipeline Overview Pipeline consumes raw data from each source and executes a sequence of steps:

  1. Download: fetch original files
  2. Preprocess: parse and normalize input formats
  3. Standardize: sanitize and canonicalize SMILES via RDKit
  4. Deduplicate: remove duplicates using InChI keys (per source, then globally) The final output is a merged “master” dataset ready for downstream analysis or modeling.

Technologies & Tools

Intended Uses This software enables users to:

Franciszek Job

Software engineer at Software Mansion Computer Science engineering undergraduate (3rd year) Mainly used technologies: Python, Rust