Efficient and accurate models for peptide function prediction

Thursday 10:30 in room 1.19 (ground floor)

Efficient and accurate models for peptide function prediction

Piotr Ludynia

Peptides, as small proteins, play crucial roles in biological processes and offer immense therapeutic potential in areas such as antimicrobial resistance, cancer treatment, and antiviral therapies. While deep learning methods like graph neural networks (GNNs) and protein language models (PLMs) have been widely explored for peptide function prediction, they often face scalability challenges and require significant computational resources.

We present methods and results from our paper (https://arxiv.org/abs/2501.17901), introducing an alternative approach that leverages molecular fingerprints—well-established chemoinformatics techniques primarily used with smaller molecules—to predict peptide properties efficiently and accurately. Our research demonstrates that count-based variants of hashed molecular fingerprints, when paired with tree-based classifiers like LightGBM, outperform deep learning methods. We validate our approach across six benchmarks and 126 datasets, achieving state-of-the-art results in peptide function prediction. Our findings challenge the assumed necessity of long-range dependencies in peptides, showing that short-range molecular substructures capture information sufficient for accurate function prediction.

Additionally, we will present performance optimizations that enhance computational efficiency, including parallel implementation and sparse representations. Our work is encapsulated in an open-source Python library, scikit-fingerprints, providing a practical tool for researchers in machine learning and computational chemistry.

This presentation will offer insights into the broader applications of peptide-based drug discovery and highlight the importance of using molecular fingerprints in chemoinformatics with scalable machine learning frameworks. Attendees will gain an understanding of current chemoinformatics research on peptides and familiarize with graph vectorization methods. They will see how combining domain-specific feature extraction with tree ensembles can yield superior results compared to complex models, all at a fraction of the computational cost.

Piotr Ludynia

I am a data science and computer science student at AGH University of Kraków. My primary interests include machine learning and chemoinformatics.