2DNMRGym: An Annotated Experimental Dataset for Atom-Level Molecular Representation Learning in 2D NMR via Surrogate Supervision
Yunrui Li, Hao Xu, Pengyu Hong

TL;DR
This paper introduces 2DNMRGym, a large annotated dataset for machine learning-based molecular interpretation of 2D NMR spectra, utilizing surrogate supervision to enable model training and evaluation.
Contribution
It provides the first large-scale annotated 2D NMR dataset with surrogate supervision, facilitating development of ML models for molecular representation learning in NMR analysis.
Findings
Benchmark results for GNN and transformer models on 2DNMRGym.
Demonstrates surrogate supervision effectiveness in NMR data interpretation.
Establishes a foundation for future ML research in NMR-based molecular analysis.
Abstract
Two-dimensional (2D) Nuclear Magnetic Resonance (NMR) spectroscopy, particularly Heteronuclear Single Quantum Coherence (HSQC) spectroscopy, plays a critical role in elucidating molecular structures, interactions, and electronic properties. However, accurately interpreting 2D NMR data remains labor-intensive and error-prone, requiring highly trained domain experts, especially for complex molecules. Machine Learning (ML) holds significant potential in 2D NMR analysis by learning molecular representations and recognizing complex patterns from data. However, progress has been limited by the lack of large-scale and high-quality annotated datasets. In this work, we introduce 2DNMRGym, the first annotated experimental dataset designed for ML-based molecular representation learning in 2D NMR. It includes over 22,000 HSQC spectra, along with the corresponding molecular graphs and SMILES…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Nuclear Physics and Applications · Hydrocarbon exploration and reservoir analysis
MethodsSparse Evolutionary Training
