# De Novo Structure Prediction from Tandem Mass Spectra: Algorithms, Benchmarks, and Limitations

**Authors:** Mark Yu. Schneider, Daniil D. Kholmanskikh, Kirill Ya. Romanov, Elena A. Perekina, Sergei A. Nikolenko, Ruslan Yu. Lukin, Ivan V. Golov

PMC · DOI: 10.3390/molecules31050769 · 2026-02-25

## TL;DR

This paper evaluates the accuracy of de novo molecule prediction from mass spectrometry data and highlights the need for better benchmarks and methods.

## Contribution

The paper introduces a rigorous benchmark and identifies data leakage issues in prior evaluations of de novo structure prediction models.

## Key findings

- State-of-the-art models achieve only 4.1% top-10 accuracy on leakage-controlled benchmarks.
- Conditioning models on molecular formulas improves exact-match accuracy significantly.
- Performance of scaffold-based generation drops drastically with predicted scaffolds.

## Abstract

The identification of unknown molecules from analytical data remains a fundamental challenge in chemistry, with critical implications for drug discovery, metabolomics, and natural product research. While tandem mass spectrometry provides rich structural fingerprints, most spectra are absent from reference libraries, spurring the development of de novo generative models. However, their true accuracy has been difficult to assess. Our critical analysis reveals that state-of-the-art models achieve only 4.1% top-10 accuracy on rigorously leakage-controlled benchmarks like MassSpecGym. This sobering figure stands in stark contrast to earlier, overly optimistic reports, a discrepancy we attribute to pervasive data leakage in naive data splits. This review traces the field’s rapid evolution through three architectural eras: from fingerprint-conditioned RNN pipelines to end-to-end sequence models and, most recently, to graph-native diffusion under molecular-formula constraints. We demonstrate that explicitly conditioning generative models on a molecular formula significantly improves exact-match accuracy compared to unconstrained baselines. Crucially, our analysis distinguishes between two experimentally relevant paradigms: formula-conditioned generation for true unknown discovery and scaffold-based generation for hypothesis-driven research. While the latter shows high potential with oracle scaffolds, its performance drastically drops with predicted ones, revealing a critical bottleneck. To build the next generation of reliable tools, we propose a clear roadmap centered on standardized, leakage-aware benchmarking and transparent reporting.

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12985711/full.md

---
Source: https://tomesphere.com/paper/PMC12985711