Dataset Distillation for Machine Learning Force Field in Phase Transition Regime
Ruiyang Chen, Qingyuan Zhang, and Ji Chen

TL;DR
This paper introduces a dataset distillation method called Central-Peripheral Distillation (CPD) to efficiently train machine learning force fields for phase transition regimes, demonstrated on dense hydrogen.
Contribution
The novel CPD algorithm strategically integrates representative and critical samples to retain structural diversity in distilled datasets for MLFF training.
Findings
Only 200 configurations suffice to train an accurate MLFF for liquid hydrogen near phase transition.
CPD outperforms traditional methods in capturing structural and dynamical properties.
The approach enables high-fidelity dataset labeling using advanced ab initio calculations.
Abstract
Machine learning force field (MLFF) has emerged as a powerful data-driven tool for atomistic simulations, enabling large-scale and complex atomic systems to be simulated with accuracy comparable to \textit{ab initio} methods. However, MLFFs often suffer from low training efficiency in the phase transition regime, where structural fluctuations are significantly elevated. To address this challenge, we propose a Central-Peripheral Distillation (CPD) algorithm for training dataset distillation. By strategically integrating representative samples with critical corner cases, the CPD algorithm ensures that the distilled dataset retains maximum structural diversity. We validated the efficacy of the CPD method on the liquid-liquid phase transition of dense hydrogen. Results show that, with the CPD approach, only 200 configurations are sufficient to train a MLFF that can fully reproduce the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
