De Novo Structure Prediction from Tandem Mass Spectra: Algorithms, Benchmarks, and Limitations
Mark Yu. Schneider, Daniil D. Kholmanskikh, Kirill Ya. Romanov, Elena A. Perekina, Sergei A. Nikolenko, Ruslan Yu. Lukin, Ivan V. Golov

TL;DR
This paper evaluates the accuracy of de novo molecule prediction from mass spectrometry data and highlights the need for better benchmarks and methods.
Contribution
The paper introduces a rigorous benchmark and identifies data leakage issues in prior evaluations of de novo structure prediction models.
Findings
State-of-the-art models achieve only 4.1% top-10 accuracy on leakage-controlled benchmarks.
Conditioning models on molecular formulas improves exact-match accuracy significantly.
Performance of scaffold-based generation drops drastically with predicted scaffolds.
Abstract
The identification of unknown molecules from analytical data remains a fundamental challenge in chemistry, with critical implications for drug discovery, metabolomics, and natural product research. While tandem mass spectrometry provides rich structural fingerprints, most spectra are absent from reference libraries, spurring the development of de novo generative models. However, their true accuracy has been difficult to assess. Our critical analysis reveals that state-of-the-art models achieve only 4.1% top-10 accuracy on rigorously leakage-controlled benchmarks like MassSpecGym. This sobering figure stands in stark contrast to earlier, overly optimistic reports, a discrepancy we attribute to pervasive data leakage in naive data splits. This review traces the field’s rapid evolution through three architectural eras: from fingerprint-conditioned RNN pipelines to end-to-end sequence…
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Drug Discovery Methods · Machine Learning in Materials Science · Mass Spectrometry Techniques and Applications
