MADGEN: Mass-Spec attends to De Novo Molecular generation
Yinkai Wang, Xiaohui Chen, Liping Liu, Soha Hassoun

TL;DR
MADGEN is a novel scaffold-based method that uses mass spectrometry data to guide de novo molecular structure generation, significantly improving annotation accuracy in complex biological samples.
Contribution
It introduces a two-stage approach combining contrastive learning for scaffold retrieval and an attention-based generative model guided by spectra, enhancing molecular annotation.
Findings
MADGEN outperforms existing methods on multiple datasets.
Attention integration improves molecular generation accuracy.
Using an oracle retriever yields the best results.
Abstract
The annotation (assigning structural chemical identities) of MS/MS spectra remains a significant challenge due to the enormous molecular diversity in biological samples and the limited scope of reference databases. Currently, the vast majority of spectral measurements remain in the "dark chemical space" without structural annotations. To improve annotation, we propose MADGEN (Mass-spec Attends to De Novo Molecular GENeration), a scaffold-based method for de novo molecular structure generation guided by mass spectrometry data. MADGEN operates in two stages: scaffold retrieval and spectra-conditioned molecular generation starting with the scaffold. In the first stage, given an MS/MS spectrum, we formulate scaffold retrieval as a ranking problem and employ contrastive learning to align mass spectra with candidate molecular scaffolds. In the second stage, starting from the retrieved…
Peer Reviews
Decision·ICLR 2025 Poster
* The two-stage idea is interesting. * The oracle retrieval method is more effective.
* The SPA of the predictive retrieval is very low. * The predictive retrieval approach yields poor molecule generation in Phase 2, where the generated structures fail to align with target properties, underscoring a critical limitation. * The conditioning of molecular generation on mass spectrometry data is largely based on classifier-free guidance, a well-established technique. The novelty is not well articulated.
- The use of scaffolds for simplifying molecular generation is a novel and effective strategy that reduces complexity. - The paper is clear and easy to understand.
- The method lacks a comparison with other baselines.
1. The two-stage approach of scaffold retrieval followed by scaffold-conditioned molecular generation presents a novel solution for de novo molecular structure prediction. 2. The paper is well-written and easy to follow. 3. The model is evaluated on multiple datasets, and a detailed ablation study is provided.
1. The scaffold retrieval performance, especially when using a predictive retriever, remains relatively low (e.g., NIST23). 2. The discussion of baselines is unclear.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsElectron and X-Ray Spectroscopy Techniques · X-ray Spectroscopy and Fluorescence Analysis · Genetics, Bioinformatics, and Biomedical Research
MethodsSoftmax · Attention Is All You Need · Contrastive Learning
