Bayesian Selection for Efficient MLIP Dataset Selection

Thomas Rocke; James Kermode

arXiv:2502.21165·cond-mat.mtrl-sci·June 23, 2025

Bayesian Selection for Efficient MLIP Dataset Selection

Thomas Rocke, James Kermode

PDF

TL;DR

This paper introduces a Bayesian selection method for constructing efficient datasets for Machine Learning Interatomic Potentials, demonstrating superior performance over random sampling and competitiveness with existing methods in silicon surface energy prediction.

Contribution

The paper presents a novel Bayesian selection approach for dataset construction in MLIP development, improving efficiency and accuracy over traditional sampling methods.

Findings

01

Bayesian selection outperforms random sampling in silicon surface energy tasks.

02

The method achieves 4.3x lower error on (100) surface energy in low data regimes.

03

It is competitive with existing selection techniques using ACE and MACE features.

Abstract

The problem of constructing a dataset for MLIP development which gives the maximum quality in the minimum amount of compute time is complex, and can be approached in a number of ways. We introduce a ``Bayesian selection" approach for selecting from a candidate set of structures, and compare the effectiveness of this method against other common approaches in the task of constructing ideal datasets targeting Silicon surface energies. We show that the Bayesian selection method performs much better than Simple Random Sampling at this task (for example, the error on the (100) surface energy is 4.3x lower in the low data regime), and is competitive with a variety of existing selection methods, using ACE and MACE features.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.