Role of Structural and Conformational Diversity for Machine Learning Potentials
Nikhil Shenoy, Prudencio Tossou, Emmanuel Noutahi, Hadrien Mary,, Dominique Beaini, Jiarui Ding

TL;DR
This paper explores how structural and conformational diversity in datasets affect the generalization of machine learning interatomic potentials, highlighting the importance of balanced data and the limitations of current models.
Contribution
It provides a detailed analysis of the impact of data diversity on MLIP performance and offers guidelines for optimizing quantum mechanics data generation.
Findings
Balanced structural and conformational diversity improves model generalization.
Existing QM datasets often lack the optimal diversity trade-off.
MLIP models struggle to generalize beyond their training distribution.
Abstract
In the field of Machine Learning Interatomic Potentials (MLIPs), understanding the intricate relationship between data biases, specifically conformational and structural diversity, and model generalization is critical in improving the quality of Quantum Mechanics (QM) data generation efforts. We investigate these dynamics through two distinct experiments: a fixed budget one, where the dataset size remains constant, and a fixed molecular set one, which focuses on fixed structural diversity while varying conformational diversity. Our results reveal nuanced patterns in generalization metrics. Notably, for optimal structural and conformational generalization, a careful balance between structural and conformational diversity is required, but existing QM datasets do not meet that trade-off. Additionally, our results highlight the limitation of the MLIP models at generalizing beyond their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Computational Drug Discovery Methods · Protein Structure and Dynamics
MethodsSparse Evolutionary Training
