MassSpecGym: A benchmark for the discovery and identification of molecules
Roman Bushuiev, Anton Bushuiev, Niek F. de Jonge, Adamo Young, Fleming, Kretschmer, Raman Samusevich, Janne Heirman, Fei Wang, Luke Zhang, Kai, D\"uhrkop, Marcus Ludwig, Nils A. Haupt, Apurva Kalia, Corinna Brungs, Robin, Schmid, Russell Greiner, Bo Wang, David S. Wishart

TL;DR
MassSpecGym introduces a comprehensive benchmark dataset and evaluation framework for molecular discovery and identification from MS/MS spectra, aiming to standardize and advance machine learning methods in this challenging field.
Contribution
It provides the first large-scale, labeled MS/MS dataset with standardized tasks and metrics, facilitating progress in molecular structure prediction from spectra.
Findings
Largest publicly available MS/MS dataset for benchmarking.
Defines three key MS/MS annotation challenges.
Introduces new evaluation metrics and data splits.
Abstract
The discovery and identification of molecules in biological and environmental samples is crucial for advancing biomedical and chemical sciences. Tandem mass spectrometry (MS/MS) is the leading technique for high-throughput elucidation of molecular structures. However, decoding a molecular structure from its mass spectrum is exceptionally challenging, even when performed by human experts. As a result, the vast majority of acquired MS/MS spectra remain uninterpreted, thereby limiting our understanding of the underlying (bio)chemical processes. Despite decades of progress in machine learning applications for predicting molecular structures from MS/MS spectra, the development of new methods is severely hindered by the lack of standard datasets and evaluation protocols. To address this problem, we propose MassSpecGym -- the first comprehensive benchmark for the discovery and identification…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
