From Human Labels to Literature: Semi-Supervised Learning of NMR Chemical Shifts at Scale
Yongqi Jin, Yecheng Wang, Jun-jie Wang, Rong Zhu, Guolin Ke, Weinan E

TL;DR
This paper introduces a semi-supervised learning framework that leverages millions of literature-extracted NMR spectra without atom-level labels to improve chemical shift prediction accuracy and robustness, capturing solvent effects at scale.
Contribution
It presents a novel permutation-invariant set supervision approach with a sorting-based loss, enabling large-scale semi-supervised training from literature data, surpassing state-of-the-art methods.
Findings
Achieves higher accuracy and robustness than existing models.
Demonstrates effective generalization on diverse molecular datasets.
Captures systematic solvent effects across common NMR solvents.
Abstract
Accurate prediction of nuclear magnetic resonance (NMR) chemical shifts is fundamental to spectral analysis and molecular structure elucidation, yet existing machine learning methods rely on limited, labor-intensive atom-assigned datasets. We propose a semi-supervised framework that learns NMR chemical shifts from millions of literature-extracted spectra without explicit atom-level assignments, integrating a small amount of labeled data with large-scale unassigned spectra. We formulate chemical shift prediction from literature spectra as a permutation-invariant set supervision problem, and show that under commonly satisfied conditions on the loss function, optimal bipartite matching reduces to a sorting-based loss, enabling stable large-scale semi-supervised training beyond traditional curated datasets. Our models achieve substantially improved accuracy and robustness over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMolecular spectroscopy and chirality · Computational Drug Discovery Methods · Machine Learning in Materials Science
