MIST-CF: Chemical formula inference from tandem mass spectra
Samuel Goldman, Jiayi Xin, Joules Provenzano, and Connor W. Coley

TL;DR
MIST-CF is a neural network-based method that improves chemical formula inference from tandem mass spectra by learning to rank candidate formulas without relying on traditional fragmentation tree construction, achieving higher accuracy.
Contribution
The paper introduces MIST-CF, a novel energy-based neural network framework that enhances chemical formula annotation from MS/MS data without expert-parameterized fragmentation trees.
Findings
10% absolute improvement in top 1 accuracy over previous neural methods
Achieved near state-of-the-art performance on CASMI2022 dataset
Circumvents the need for manual curation and post-processing
Abstract
Chemical formula annotation for tandem mass spectrometry (MS/MS) data is the first step toward structurally elucidating unknown metabolites. While great strides have been made toward solving this problem, the current state-of-the-art method depends on time-intensive, proprietary, and expert-parameterized fragmentation tree construction and scoring. In this work we extend our previous spectrum Transformer methodology into an energy based modeling framework, MIST-CF, for learning to rank chemical formula and adduct assignments given an unannotated MS/MS spectrum. Importantly, MIST-CF learns in a data dependent fashion using a Formula Transformer neural network architecture and circumvents the need for fragmentation tree construction. We train and evaluate our model on a large open-access database, showing an absolute improvement of 10% top 1 accuracy over other neural network…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMetabolomics and Mass Spectrometry Studies · Advanced Chemical Sensor Technologies · Genomics and Phylogenetic Studies
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Residual Connection · Absolute Position Encodings · Adam · Layer Normalization · Label Smoothing
