A Bin and Hash Method for Analyzing Reference Data and Descriptors in Machine Learning Potentials
Mart\'in Leandro Paleico, J\"org Behler

TL;DR
This paper introduces the bin-and-hash (BAH) algorithm, a novel method to efficiently analyze and compare large multidimensional datasets in machine learning potentials, improving data handling and quality assessment.
Contribution
The BAH algorithm provides a general, efficient approach for identifying and comparing large sets of vectors in ML potentials, aiding in data reduction and quality control.
Findings
Enables efficient comparison of large multidimensional vectors
Reduces redundancy in reference datasets
Improves assessment of descriptor quality
Abstract
In recent years the development of machine learning (ML) potentials (MLP) has become a very active field of research. Numerous approaches have been proposed, which allow to perform extended simulations of large systems at a small fraction of the computational costs of electronic structure calculations. The key to the success of modern ML potentials is the close-to first principles quality description of the atomic interactions. This accuracy is reached by using very flexible functional forms in combination with high-level reference data from electronic structure calculations. These data sets can include up to hundreds of thousands of structures covering millions of atomic environments to ensure that all relevant features of the potential energy surface are well represented. The handling of such large data sets is nowadays becoming one of the main challenges in the construction of ML…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
