TL;DR
This paper introduces an active learning approach to efficiently generate training data for universal machine learning potentials, significantly reducing data requirements while maintaining high accuracy across diverse organic molecules.
Contribution
It presents a novel active learning algorithm based on Query by Committee for automated dataset generation of ML potentials, improving efficiency and transferability.
Findings
AL reduces training data by up to 90% compared to random sampling.
AL-based potentials match or outperform models trained on larger datasets.
The developed ANI-1x potential is accurate across diverse organic molecules.
Abstract
The development of accurate and transferable machine learning (ML) potentials for predicting molecular energetics is a challenging task. The process of data generation to train such ML potentials is a task neither well understood nor researched in detail. In this work, we present a fully automated approach for the generation of datasets with the intent of training universal ML potentials. It is based on the concept of active learning (AL) via Query by Committee (QBC), which uses the disagreement between an ensemble of ML potentials to infer the reliability of the ensemble's prediction. QBC allows the presented AL algorithm to automatically sample regions of chemical space where the ML potential fails to accurately predict the potential energy. AL improves the overall fitness of ANAKIN-ME (ANI) deep learning potentials in rigorous test cases by mitigating human biases in deciding what…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
