Enhanced Representation-Based Sampling for the Efficient Generation of Data Sets for Machine-Learned Interatomic Potentials
Moritz R. Schäfer, Johannes Kästner

TL;DR
This paper introduces ERBS, a new method to efficiently generate diverse training data for machine-learned interatomic potentials, improving model accuracy with less data.
Contribution
ERBS combines dimensionality reduction and bias potentials to automatically sample diverse atomic configurations for training machine-learned potentials.
Findings
ERBS reconstructs accurate free energy surfaces using short biased trajectories.
ERBS-generated data for liquid water matches reference models with fewer data points.
ERBS outperforms uncertainty-driven dynamics in exploring configurational space.
Abstract
In this work, we present enhanced representation-based sampling (ERBS), a novel enhanced sampling method designed to generate structurally diverse training data sets for machine-learned interatomic potentials. ERBS automatically identifies collective variables by dimensionality reduction of atomic descriptors and applies a bias potential inspired by the On-the-Fly probability enhanced sampling framework. We highlight the ability of Gaussian moment descriptors to capture collective molecular motions and explore the impact of biasing parameters using alanine dipeptide as a benchmark system. We show that free energy surfaces can be reconstructed with high fidelity using only short biased trajectories as training data. Further, we apply the method to the iterative construction of a liquid water data set and compare the quality of simulated self-diffusion coefficients for models trained with…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
1
2
3
4
5
6
7- —Deutsche Forschungsgemeinschaft10.13039/501100001659
- —Deutsche Forschungsgemeinschaft10.13039/501100001659
- —Deutsche Forschungsgemeinschaft10.13039/501100001659
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Protein Structure and Dynamics · Advanced Physical and Chemical Molecular Interactions
Introduction
1
Machine-learned interatomic potentials (MLIPs) have proven themselves to be a suitable approach to studying the dynamics of atomistic systems over the last couple of years. ?,? By training a machine learning model on the results of ab initio calculations, it is possible to perform molecular dynamics simulations with nearly the accuracy of the reference methods and cost that scales linearly with system size. Following the seminal work by Behler and Parinello ?,? and Bartok and Csanyi, ?−? ? numerous model architectures have been proposed. These include a broad range of approaches, from descriptor-based kernel models ?,? to equivariant message passing models ?,? and many others. ?−? ? While a significant focus by the community has been put into developing new architectures, all data-driven models are only as good as the data they were trained on. The quality of training data is becoming increasingly critical with the emerging interest in atomistic foundation models. ?,? Unlike potentials trained for a single system, these models aim to maximize transferability and robustness across domains.? However, current observations suggest inconsistent performance away from equilibrium and in the treatment of soft modes,? likely caused by a heavy reliance on relaxation trajectories as training data. Consequently, the development of methods capable of generating structurally diverse data sets is of utmost importance for the continued advancement of general-purpose atomistic models.
Historically, constructing data sets for MLIPs was a costly and labor-intensive process, as reference data was generated using ab initio molecular dynamics. ?,? Sampling a configuration space with this approach results in highly correlated samples, each requiring costly calculations. With the emergence of active learning approaches,? researchers have started to use preliminary MLIPs for the sampling of candidate configurations, for example, by molecular dynamics. Thus, it became possible to reduce the usage of costly quantum chemical reference methods only on the most informative candidates, resulting in more compact and more informative data sets.
A key advancement was the introduction of uncertainty estimation methods to the realm of MLIPs. With techniques such as model ensembling, ?−? ? Gaussian processes, ?,? or optimal experimental design,? it became possible to estimate the error of model predictions. Using these uncertainty estimates, MLIP-driven simulations can be terminated when uncertainty is high, and the most uncertain configurations can be selected for recalculation using ab initio methods. The model is then retrained on the updated data set. This iterative process, often called active learning or learning-on-the-fly, ?,? gradually extends data sets with informative structures, resulting in more compact data sets that cover large parts of the relevant configurational space.
Data set creation was further improved by the use of enhanced sampling methods, which increase structural diversity in the collected configurations. The initial work by Herr et al.? utilized the root-mean-square deviation (RMSD) between the current configuration and a series of previous ones as a collective variable. A metadynamics-like bias potential? was applied to promote the exploration of diverse regions within configurational space. More recent approaches have been tailored more specifically for MLIPs. Yoo et al.? have used the descriptor of high-dimensional neural network potentials as the collective variable instead of the RMSD.
Uncertainty-driven dynamics (UDD)? and hyperactive learning ?,?,? use the model’s uncertainty to bias the system toward regions where the model is less confident. Contour exploration ?,? evolves the system along constant potential energy contours. The position updates are only limited by the local curvature of the PES thereby allowing for significantly larger position updates than MD, without relying on uncertainty estimates like in UDD. Each of these provides a significant improvement in sampling efficiency over unbiased MD at a constant temperature. However, these methods also have certain limitations. The approach by Yoo et al.,? which performs metadynamics in descriptor space, constructs per-atom bias potentials, requiring a large number of descriptor comparisons and potentially leading to significant computational overhead. Uncertainty-based methods, while effective in driving exploration toward regions of high model error, do not explicitly account for the separation in time scales of different degrees of freedom. Intermolecular forces, such as those determining dynamical observables in liquids, are significantly smaller than intramolecular ones. Thus, if the target quantity is small and underestimated, a calibrated uncertainty estimate will also be small, and uncertainty-based methods will not significantly enhance sampling along slow degrees of freedom. This highlights a core conceptual difference between sampling objectives: increasing epistemic uncertainty versus increasing input diversity. Uncertainty-driven approaches are reactive; they rely on the model effectively identifying its own knowledge gaps, leaving them vulnerable to poor calibration or noise. Conversely, input diversity-driven approaches aim to maximize the volume of explored descriptor space independent of model error. By forcing the system to populate underrepresented regions of the descriptor manifold, they ensure robust generalization and comprehensive phase space coverage.
In this work, we introduce enhanced representation-based sampling (ERBS), a novel enhanced sampling method for efficiently generating training data for MLIPs. Starting from the mean descriptor of the system, we extract a small set of collective variables (CVs) via principal component analysis (PCA).? Using these CVs, we construct a bias potential based on the recently introduced OPES-Explore framework.? The combination of these CVs and bias potential allows for a rapid exploration of configurational space by following trajectories that sample preferentially along the k maximum variance components of the MLIP features.
This work is structured as follows. We begin with a description of the ERBS method. Its usefulness in creating a static, nonactive learned data set is explored for the alanine dipeptide system. Here, a screening of the bias parameters is performed, and the resulting configurational space coverage is analyzed. For some sets of these parameters, the generated trajectory is used as training data for MLIPs. We first cross-validate the prediction metrics of models using both biased and unbiased validation data sets. Models trained on low- and high-temperature MD and ERBS trajectories are then used to compute the free energy surface (FES) of the dihedral angles in alanine dipeptide. We find that the ERBS-trained models achieve lower errors with respect to the true FES compared to the MD-trained models, with the low-temperature MD model not producing a stable trajectory at all.
As a demonstration of ERBS’ use in an active learning setting, we turn to liquid water. Two active learning workflows are set up, one using unbiased MD and one using ERBS biasing for sampling candidate configurations. At each iteration of the workflow, the prediction metrics on the water data set by Cheng et al.? are calculated, and the diffusion coefficients are simulated. We find that the prediction error on the literature test set decreases and diffusion coefficients converge to the value obtained from a model trained on the data set by Cheng et al. significantly faster for the ERBS run.
Finally, we compare the sampling behavior of ERBS with that of UDD for the viscous room temperature ionic liquid 1-butyl-3-methylimidazolium tetrafluoroborate (BMIM^+^BF_4_ ^–^). Across a wide range of parameters, ERBS increases the mean squared displacement of up to 4 times compared to MD and 2 times compared to the best UDD result, indicating enhanced exploration of configurational space.
Methods
2
Gaussian Moment Neural
Network
2.1
Our enhanced-sampling approach descriptor agnostic. However, here we base it on the Gaussian-Moment Neural Network (GMNN) ?,? approach as it offers very fast training an inference times, while still achieving good prediction accuracy. Thus, GMNN is discussed here.
Given an atomic configuration S consisting of Cartesian coordinates R and atomic numbers Z, potentials used in molecular dynamics map from S to a potential energy E. Most MLIPs utilize an atomic energy decomposition to predict energies for each local atomic environment. ?,?
Restricting the range of interactions can be motivated by the short-sightedness of electronic matter,? and the resulting linear scaling with system size has significantly contributed to these models’ scalability. The GMNN model consists of a descriptor, which constructs an invariant representation of each atom, and neural networks for predicting atomic energies.
First, the pairwise distances between a central atom i and its neighbors j are expanded in a radial basis, using Gaussian? or Bessel functions.? Embedding parameters are used to form linear combinations of the n′ original basis functions dependent on the atomic numbers of the central and neighboring atoms, Z _ i _ and Z _ j _. This results in a contracted radial channel n. Finally, angular information is captured by Cartesian moments, i.e., polynomials of the unit distance vectors up to some rotation order L.
The invariant descriptor ** G ** is obtained from fully contracting the equivariant features Ψ_ i,L,n _ according to
Atomic energies are predicted from neural networks as E _ i _ = NN(** G ** _ i ) and adjusted by element-specific scaling and shifting parameters, σ Z _ i _ _ and μ_ Z _ i _ _.
Finally, the atomic energies E _ i _ are summed up as in eq. Forces are calculated as the gradient of the total energy using automatic differentiation. All learnable parameters of the model are optimized using stochastic gradient-based optimization. The loss function minimized during training contains terms for energy and force errors
Here, λ_E_ and λ_F_ denote hyperparameters for weighting the energy and force loss contributions, respectively.
Active Learning
2.2
In cases where no previously existing data set can be used for training an MLIP, a new one has to be created from scratch. However, ab initio molecular dynamics simulations are an expensive way to create training data as subsequent time steps are highly correlated. Consequently, a common approach is to start from a handful of samples, train an initial model, and use it to sample new candidate structures. A crucial step during sampling simulations is to terminate the trajectories when the model predictions become too inaccurate. Estimating the error in model predictions is known as uncertainty quantification and is an active area of research. ?,?−? ? ?
In the present work, we use shallow ensembles recently proposed by Kellner and Ceriotti.? A shallow ensemble shares the weights for all but the last linear layer. Instead of predicting a single atomic energy as in eq, the model instead predicts N ens values, and the uncertainty of a prediction can be estimated from the sample standard deviation of the ensemble.
The mean of the ensemble predictions is used to drive the dynamics.
A crucial aspect is the calibration of the predicted uncertainty, i.e., how well predicted uncertainty and true error correlate. By training a shallow ensemble on a probabilistic loss function, like the negative log likelihood (NLL), miscalibrated uncertainty estimates are directly penalized during training.
Once a sampling simulation has completed or was terminated after exceeding an uncertainty threshold, the trajectory represents a pool of candidate data and new data points can be selected from it. Given a pool , batch active learning methods select a subset of the structures that maximizes an acquisition function a,? which may depend on the model parameters
Throughout this work, we use greedy maximum-distance selection with a last-layer gradient feature map?
The feature map ϕ_ll_(S) serves to compare the similarity of two structures in terms of the model’s last layer weight gradient. Eq is applied iteratively to select structurally diverse points in feature space, a selection algorithm also known as farthest point sampling.? More details on the selection methods can be found in the original work by Zaverkin et al.?
Figurea depicts the typical steps involved in an active learning loop. As training iterations proceed, model uncertainty decreases, and the quality and stability of simulated trajectories increase (Figureb). Viewed differently, it takes increasingly longer for the model to encounter informative new configurations during a sampling simulation. As a result, the sampling trajectories need to become progressively longer.
(a) Illustration of an active learning cycle: Starting from seed data, models are iteratively trained and used to sample candidate configurations. From these candidates, the most informative ones are selected for DFT calculations. (b) Sampling stability improves and predicted uncertainties decrease over active learning iterations. (c) Workflow for constructing the ERBS potential: high-dimensional atomic descriptors s′ are aggregated and reduced to collective variables s via dimensionality reduction. The bias potential flattens the sampled distribution P.
In order to increase the diversity of sampled structures, enhanced sampling methods for the generation of MLIP training data have been developed. One such approach, UDD, builds directly on the model’s uncertainty estimate and uses it to steer the dynamics toward more uncertain regions. In UDD, a bias potential is constructed to encourage exploration of regions with high uncertainty?
Here, A and B are empirically determined parameters determining the strength of the bias and its gradient. The MD is then propagated by the sum of the MLIP prediction and E UDD. The methods described so far and the GMNN architecture are implemented in apax,? which is used for all model trainings and MD simulations in this work.
Enhanced Representation
Based Sampling
2.3
Improving configurational space sampling for MLIP data generation can be framed as exploring diverse inputs to the model. To an MLIP like GMNN, the descriptor represents the model’s input and provides a general-purpose set of collective variables for identifying undersampled regions in configuration space. For the present method, we use the average descriptor vector of the entire system.
Constructing a global descriptor ensures differentiability and computational efficiency and has found various applications in data set analysis and data selection. ?−? ? Here, we use a variant of the descriptor in eq without element-dependent parameters, β. However, including them or using other model features would also be possible. Over the course of a simulation, system-averaged descriptors are collected at fixed time intervals and used as reference descriptors to compare the similarity between the current configuration and past ones.
The descriptor is high-dimensional, and its entries are correlated with each other. As a result, sampling the descriptor space directly is challenging due to the curse of dimensionality. A reduced set of CVs is obtained from principal component analysis (PCA),? which identifies the most relevant collective motions in descriptor space. PCA is a common dimensionality reduction method that has also seen various applications in CV-based enhanced sampling. ?,? As usual for PCA, the data matrix is first centered using the per-feature mean μ.
Here, the data matrix S′ consists of stacked reference descriptors s′, and Ŝ represents its centered version. The principal components V are obtained from a singular value decomposition of Ŝ. By truncating V to the first k columns, we obtain a projection matrix, V ^(k)^, used to reduce the dimensionality of the descriptor. During sampling simulations, the CVs are computed via the feature dimensionality reduction function ϕ.
Although PCA is used throughout all experiments, we emphasize that other choices of ϕ, such as autoencoders,? are possible. Based on these CVs, a bias potential is constructed similarly to the “explore” variant of On-the-Fly probability enhanced sampling (OPES). ?,? While OPES estimates the unbiased probability distribution, OPES-explore estimates the well-tempered one. Since the well-tempered distribution is smoothed out, fewer kernels are required to estimate it.? The probability density of the CV space is modeled on-the-fly by depositing Gaussian kernels, K, at fixed intervals during a molecular dynamics simulation centered on the current averaged descriptor s _ j _.
We restrict ourselves to isotropic covariances Σ for computational efficiency. As we implement the kernel compression algorithm from Invernizzi and Parrinello,? these may become diagonal over the course of a simulation.
The overall well-tempered probability density p_n_ ^WT^(s) is, thus, modeled by the average over all kernels K(s, s _ j _).
The OPES-explore bias potential at time step n can then be calculated from the probability density as
Z _ n _ is a modified normalization constant computed from numerically integrating the density with the kernel centers as integration points. The parameter ϵ contains a barrier parameter ΔE via γ = βΔE, where β = 1/(k B T) is the inverse thermal energy.
The incorporation of ΔE allows the method to place a soft limit on the maximum strength of the bias potential. OPES-explore offers several advantages over metadynamics for configurational space exploration. While metadynamics and its variants slowly deposit Gaussian bias hills, OPES-explore models the probability density. As p_n_ ^WT^ is normalized, the simulation starts with a strong bias right from the start. Further, the normalization constant is modified in such a way that it only increases when kernels overlap, improving exploration. A more detailed discussion of these advantages and the construction of Z _ n _ can be found in the original publication by Invernizzi and Parrinello.? The conceptual steps involved in the ERBS method are illustrated in Figurec. It should be noted that a simulation can either use a fixed PCA basis constructed from a preexisting data set or use a variable one that is recomputed every time a new reference descriptor is added. For a fixed dimensionality reduction function the target-probability density in eq is well-defined. As the goal of ERBS is not to be a free energy method, but to sample the most structurally diverse configurations within a physically meaningful energy range, we use a variable basis for all experiments.
Finally, to analyze the scalability of ERBS, we consider the computational cost associated with the bias force evaluation. The gradient of the bias potential in eq can be written as
Here, is the Jacobian of the reduced descriptor vector s with respect to the atomic positions R. Notably, it is independent of the number of reference descriptors. The term , on the other hand, involves the sum of kernel gradient contributions from each of the reference descriptors and thus scales linearly with the number of references. However, since the first term consists of simple algebraic operations, its cost is negligible compared to that of evaluating .
As a result, the overall cost of ERBS using the GM descriptor is comparable to the cost of a GMNN force evaluation and remains practically independent of the number of reference descriptors used for even extensive active learning sampling runs. A comparison of the scaling behavior for ERBS and the method by Yoo et al.? can be found in the Supporting Information.
Results
3
Static Data Set Generation
for Alanine Dipeptide
3.1
Alanine dipeptide is a commonly used test system in the enhanced-sampling literature ?,? due to the thoroughly investigated FES in two of its dihedral angles, Φ and Ψ. Consequently, the degree to which a trajectory covers the dihedral angle space serves as an indication of how well a general-purpose sampling method is able to identify physically relevant degrees of freedom. In order to examine the sensitivity of ERBS to the particular choice of parameters, we perform an extensive parameter scan on alanine dipeptide in vacuum. The scanning ranges for the parameters in eqs, ? and ? are selected as follows. For the barrier parameter in eq, we use ΔE = {5, 10, 15, 20, 25} eV. The covariance matrix in eq is chosen as Σ = σ^2^ I _ k _, with σ = {0.05, 0.1, 0.2, 0.5}, corresponding to isotropic kernels in a reduced feature space. Finally, the dimensionality of the descriptor space in eq is selected from k = {2, 4, 6, 8, 10}.
For each possible parameter combination, an 80 ps trajectory is simulated using a Langevin thermostat at 300 K with a time step of 0.5 fs using the Amber99-SB force field ?,? in CP2K.? New kernels were deposited every 10,000 steps. In addition to the biased simulations, we also compute unbiased trajectories at 300 and 1200 K to set a baseline for the system’s exploration and to verify whether the effect of the bias could also be reached by increasing the temperature. The Φ–Ψ-space coverage for the unbiased and biased trajectories is displayed in Figure. In order to track and quantify the coverage, the space was tiled into 15°-by-15° squares. Coverage is then calculated by the ratio of visited squares to their total amount.
(a–c) Coverage vs parameter choices, (d) Φ–Ψ-space coverage of alanine dipeptide over time using molecular dynamics and the enhanced sampling method.
The initial structure is located in a free energy minimum. The unbiased dynamics are stuck there for the entire duration of the trajectory; hence, the coverage is severely limited. A few parameter choices with k = 2 lead to the dissociation of the molecule. This is expected for high barrier heights and small Gaussian bandwidths, which lead to such strong forces that the simulation becomes unstable. While the number of physically relevant collective variables in this system is two, ERBS learns them on the fly and, thus, needs time to identify them over the course of a simulation. It is thus advisible to initially overestimate the value of k, since there is only a negligible cost associated with an increase in k we largely find an insensitivity of the coverage to the particular choice of the number of principal components. The lack of a trend for k ≥ 2 demonstrates that the PCA successfully concentrated the relevant slow dynamics into the first few principal components. Otherwise, the exploration was drastically increased when compared to the unbiased simulation at 300 K, and most parameter choices outperformed even the 1200 K trajectories exploration with coverages of up to 75%. The parameter set leading to the largest coverage used ΔE = 20 eV, σ = 0.05 and k = 4 We will refer to the trajectory with the best coverage as ERBS B. We also highlight a second trajectory which also achieves good coverage but with a completely different set of parameters, ΔE = 25 eV, σ = 0.1 and k = 10 and refer to it as ERBS A.
Next, the suitability of the biased trajectories as MLIP training data was investigated. Four trajectories were chosen for data set creation: the unbiased MD trajectories at 300 and 1200 K and the biased trajectories ERBS A and ERBS B The data sets contain 1800 training and 200 validation samples, respectively. In each case, the data points are selected randomly, but the validation samples are taken from the last 20% of the trajectory. GMNN models were trained on each data set using identical hyperparameters. A radial basis consisting of 16 spherical Bessel functions was chosen with a cutoff of 5 Å. Two neural network layers of size 64 were used, and the model was trained with the AdamW optimizer. ?,? Further training details can be found in Table S1.
The force MAEs of each model evaluated on each data set are displayed in Figure.
Cross-validation of force MAEs. The x-axis indicates the training data set used to generate each model. The colored bars within each group represent the model’s performance on the four distinct validation data sets, as defined in the legend.
For each validation data set, the model trained on the corresponding training data set achieves the lowest error. While the model trained directly on the 300 K data achieves the lowest errors on the 300 K validation set, it performs the worst on all other validation sets. Conversely, the model trained on the 1200 K MD data generalizes better to other MD-derived data sets but ranks near the bottom on the ERBS validation sets. In contrast, models trained on ERBS-sampled data perform consistently well across biased and unbiased validation sets, demonstrating strong transferability. More detailed prediction-error parity plots for the four models can be found in Figure S5.
While validation metrics give a reasonable first estimate of model performance, the purpose of MLIPs is to compute physical observables, such as FESs. The FES in Φ and Ψ can be evaluated using models trained on configurations from biased and unbiased dynamics. Well-tempered metadynamics simulations using Plumed ?,? were conducted for the MD-300 K, MD–1200 K, and ERBS models trained on the data sets from the previous experiment. A biasing factor of 10, hill size of 1.2 kJ mol^–1^, and bandwidth of 0.35 rad were used. It should be emphasized that the parameters for metadynamics fulfill different purposes and cannot be directly compared to the ones used in ERBS despite similar names. Further, a ground truth simulation was carried out with the Amber99-SB force field. All simulations were conducted in the NVT ensemble at 300 K and lasted for 10 ns. The reference free-energy surface and the errors of the FES produced by the models MD-1200 K and the ERBS models are displayed in Figure. All FES were shifted such that the global minimum of each surface is set to zero. This alignment allows for consistent comparison of relative free energy differences, and the error was computed with respect to the reference FES based on the aligned values.
(a) Free energy surface (FES) of alanine dipeptide computed with the reference force field. (b–d) Signed errors (ΔFES = FESMLIP–FESReference) between the reference and the FES computed with models trained on data from (b) MD-1200 K, (c) ERBS A, and (d) ERBS B. Blue regions indicate lower predicted free energies relative to the reference, while red regions indicate overestimation. The red cross and green stars indicate the location of the Global free energy minimum of the reference force field and trained MLIPs, respectively.
The model trained on the 300 K trajectory data is unstable. The system forms new bonds during the trajectory, which keeps it stuck in a nonphysical free-energy minimum. Despite the different parameter choices for the bias and resulting differences in the Φ–Ψ space coverage of models ERBS A and ERBS B, they achieve similar quality in their FESs. While the MD-1200 K model also achieves good metrics, it is surpassed by both models trained on enhanced sampling data. In terms of localization, both the ERBS B and MD 1200 K models correctly identify the global free energy minimum, whereas ERBS A exhibits a minor deviation. However, the MD 1200 K model yields significantly larger errors in the broader free energy surface. This discrepancy can be attributed to the training data: the 1200 K trajectory remained trapped within the global minimum basin, failing to explore the right-hand side of Ramachandran space. The coverage of the three sampling trajectories is shown in Figure S1. The best model, ERBS A, achieves a free energy MAE of 1.02 kcal mol^–1^, which is almost chemically accurate, but could easily be refined for production accuracy within a few active learning iterations.
In a recent study by Tan et al.,? the data efficiency of MLIPs in reconstructing the alanine dipeptide FES was investigated. Using their novel eABF method, they perform active learning for the same system and conclude that an accurate FES may be accessible at 4000 data points. The data sets generated using the ERBS method were taken from static 80 ps trajectories and are comprised of 2000 data points, suggesting that chemical accuracy could be achieved with significantly less than 4000 data points when using ERBS biasing in an active learning setting.
Accelerated
Active Learning of Liquid Water
3.2
In order to demonstrate the efficacy of ERBS in an active learning setting, we consider the case of liquid water. The initial system was constructed using Packmol? with 32 water molecules in a periodic box at the experimental density of 997 kg m^–3^. Energy and forces of all data points collected in the active learning scheme were computed using the revPBE0 hybrid functional? with a plane wave cutoff of 400 Ry, TZV2P-GTH basis sets,? Goedecker–Teter–Hutter pseudo potentials ?−? ? and the D3 dispersion correction ?,? in CP2K.? All DFT parameters are adapted from Cheng et al.?
Starting from the Packmol structure, 200 configurations were bootstrapped by randomly rotating and translating molecules and atomic displacements. The first 2 iterations of active learning were performed on these bootstrapped samples as follows. A first model was trained on 10 and validated on 6 randomly selected data points. From the remaining data pool, another 5 training samples were chosen using the maximum distance selection with last-layer gradient features.
Afterward, two AL workflows were set up. Both are identical apart from the use or lack of the ERBS bias potential. For the biased workflow, a new kernel was placed every 5 ps with a barrier height of 0.5 eV per atom and a bandwidth of 1.0. The first 3 PCA components were used in the construction of the CVs. GMNN models were trained as shallow ensembles with 16 members in order to have access to uncertainty estimates. The radial cutoff was chosen to be 5.5 Å and a neural network with two hidden layers of 128 and 64 units was used throughout. The remaining hyperparameters are listed in Table S1.
For every iteration, a 25 ps sampling simulation was set up using a time step of 0.5 fs and a coupling constant of 500 fs for the Berendsen thermostat. A force uncertainty threshold of 3.0 meV Å^–1^ was used as a stopping criterion for the trajectories. The sampling simulations were conducted with the ASESafeSampling node in IPSuite. Upon reaching the uncertainty threshold, the geometry is reset to the starting configuration, the momenta are initialized, and the simulation continues until the specified duration is reached. After each sampling simulation, 4 data points were chosen randomly for the validation data set, and 10 data points were chosen using maximum distance selection with last-layer gradient features from the trajectory. For the remainder of the active learning cycles, these two methods were used for all training and validation data selections, respectively. In total, 10 active learning iterations were conducted for both setups, resulting in 115 training and 46 validation samples in each case. The prediction-error parity plots for the active learned models as well as the model trained on the literature data set can be found in Figure S6.
The water data set by Cheng et al.? was generated with an emphasis on structural diversity and at in part sampled using path-integral molecular dynamics. Models trained on it were shown to achieve good agreement with experiment for structural properties of liquid water, relative stabilities of different phases of water ice and other thermodynamic observables. As we use the same level of theory and DFT code for labeling data in this experiment, we can evaluate all models on their data set. Figurea displays the Force MAE for each model iteration from the active learning runs with and without enhanced sampling compared to a model fitted directly to the literature training data. We observe that the models trained using ERBS-enhanced sampling consistently achieve lower force errors over all AL iterations compared to those trained using standard MD AL sampling. As the number of AL cycles increases, the performance gap between the two approaches narrows. It is to be expected that neither workflow achieves chemical accuracy on the literature test data set. The test data set includes configurations of varying densities and bond breakages, neither of which is contained in the data sets constructed here.
(a) Force mean absolute errors with respect to the literature water data set across active learning iterations for the models created with MD and ERBS-based active learning. (b) Diffusion coefficients of water simulated using models from successive active learning iterations, compared to the value obtained with the model trained on the literature data set.
In addition to test set metrics, we simulate diffusion coefficients to investigate the ability of the two AL setups in producing models that can accurately simulate experimental observables. We simulate 5 ns trajectories of 256 water molecules in the NVT ensemble at 300 K using the models obtained from each AL iteration. Additionally, the diffusion was also simulated with the model trained on the literature data set in order to rule out model-architecture-specific limitations in describing the dynamics. The center-of-mass diffusion coefficients, D are obtained from a linear fit to the mean-squared displacement (MSD)
Here, the sum goes over all N equivalent particle positions, R, or in this case the mass centers of the water molecules and Δt is the simulation time. Based on the diffusion coefficient thus calculated from a finite system, the infinite system-size limit is obtained via the Yeh–Hummer correction.?
The first term, D(L) corresponds to the diffusion coefficient obtained from rearranging 21 for a finite system of side length L. The correction term includes the geometry-dependent constant ξ = 2.837297 (for a cubic box) and the shear viscosity η.
The first 200 ps of the trajectories are discarded for equilibration. To estimate uncertainties, block averaging is applied. All simulated diffusion coefficients and the experimental value are displayed in Figureb.
Diffusion coefficients simulated with the ERBS models are consistently closer the reference value obtained with the model trained on the literature data set across all AL iterations. Starting from iteration 4, the ERBS model diffusion coefficients reach good agreement with the value produced with the reference model, with the standard errors overlapping for 5 of the remaining 7 trajectories. To further validate the active-learned models, we have computed the oxygen–oxygen radial distribution functions, which can be found in Figure S2
While the reference and ERBS models yield similar diffusion coefficients, they both overestimate the experimental value of 241 Å^2^ ns^–1^ ? by about 25 Å^2^ ns^–1^. Possible reasons could either be limitations of the GMNN model or a tendency of the PBE0 hybrid functional to overestimate the diffusion. Diffusion coefficients reported in AIMD studies strongly depend on the choice of density functional. ?−? ? In contrast, Daru et al.? report excellent agreement between simulated and experimental diffusion coefficients using an MLIP trained on coupled cluster data, with simulations that also account for nuclear quantum effects. It is worth noting that, due to the small data sets used here, adding a few data points can change the PES of the model quite drastically between iterations, and a monotonic convergence to the reference model cannot be expected. Nevertheless, the rapid improvement and overall stability of the ERBS-trained models highlight the effectiveness of enhanced sampling in building accurate MLIPs with minimal data.
Comparison to Uncertainty
Based Sampling
3.3
To benchmark the sampling performance of ERBS against UDD, we selected the ionic liquid BMIM^+^BF_4_ ^–^ as a test system. Its inherently high viscosity characteristic poses a particular challenge for active learning, as long molecular dynamics trajectories are typically required to adequately explore intermolecular interactions. ?−? ?
A shallow ensemble is trained on a 200-structure subset of the BMIM^+^BF_4_ ^–^ data set created by Zills et al.? The prediction-error parity plots for the model are shown in Figure S7. The model hyperparameters are identical to the experiment on liquid water from the previous section. It achieves energy and force MAEs on the validation set of 0.75 meV atom^–1^ and 62 meV Å^–1^ respectively. The quality of uncertainty estimates is discussed in Section S5. Next, we perform sampling simulations with unbiased MD, UDD, and ERBS. To ensure a fair comparison, we performed parameter scans for both UDD and ERBS. We perform UDD for all combinations of A = {0.1, 0.5, 1.0, 2.0, 5.0, 10.0} eV atom^–1^ and B = {0.1, 0.5, 1.0, 2.0} eV cf. Eq. As ERBS has more parameters, we focus on the barrier height, bandwidth, and number of principal components. Starting from ΔE = 1.5 eV atom^–1^, σ = 2.0, and k = 2, one parameter at a time was scanned while holding the others fixed to reduce the number of simulations. The individual parameter ranges were chosen as ΔE = {0.1, 0.5, 1.5, 2.5} eV atom^–1^, σ = {0.5, 1.0, 2.0, 5.0} and k = {1, 2, 3, 4}. The sampling simulations lasted for 100 ps and were simulated with a Berendsen thermostat using a coupling constant of 500 fs and a time step of 0.5 fs.
To assess sampling efficiency, we use the mean squared displacement (MSD) of the BF_4_ center of mass, as defined in eq, as a proxy measure. A higher MSD reflects a larger deviation from the initial configuration, indicating more extensive sampling of the intermolecular degrees of freedom. The maximum MSD for each trajectory is displayed in Figure.
Final mean square displacements BF4 – center of masses from BMIM+BF4 –trajectories simulated with MD and various parameters for ERBS and UDD. For the three ERBS parameter scans, the size of the dots is proportional to the value of the respective parameter. For ERBS, these ranges are ΔE = {0.1, 0.5, 1.5, 2.5} eV atom–1, σ = {0.5, 1.0, 2.0, 5.0} and k = {1, 2, 3, 4} and for UDD A = {0.1, 0.5, 1.0, 2.0, 5.0, 10.0} and B = {0.1, 0.5, 1.0, 2.0}.
It can be seen that UDD is capable of enhancing the sampling of intermolecular degrees of freedom to some degree, but only a few parameter choices offer a significant enhancement. The trajectory with an MSD of 0.5 Å^2^ ns^–1^ terminated early due to bond breakage, leading to small overall displacements. All runs with B = 0.1 eV terminated within the first 100 simulation steps, before a configuration could be collected. In the case of ERBS, only the ΔE = 2.5 eV atom^–1^ run is terminated early for the same reason. We find that for most parameter choices, ERBS drastically enhances the intermolecular motions, while the enhancement for UDD is significantly less pronounced. In the best cases, ERBS achieves an MSD almost 5 times larger than unbiased MD, while UDD only shows an increase of a factor of 1.8.
To visualize the differences in exploration strategies from, we consider the distribution of averaged Gaussian moment descriptors. We compute the average descriptors for the MD trajectory and the highest MSD UDD, and ERBS trajectories and project them into a common 2D PCA basis. Figure reveals that the UDD trajectory samples configurations that are shifted with respect to the MD trajectory in the dimensionality reduced descriptor space. For ERBS, there is a pronounced clustering, arising from the deposition of new bias potential into the descriptor space. In contrast, the ERBS trajectory displays a distinct clustering pattern that evolves over time, a direct consequence of the iterative deposition of bias potential in descriptor space. Overall, the ERBS trajectory exhibits the highest variance in descriptor space, highlighting the method’s ability to drive the system out of local minima and explore a significantly larger volume of the physically relevant configurational space.
First two principal components of the system average descriptors of BMIM+BF4 – for an MD, a UDD and an ERBS trajectory. The ERBS descriptors are color-coded by the simulation time to highlight the clustering.
To explain the sampling behavior of UDD, we consider the bias forces
where the derivative of the predicted energy variance σ_ E _ ^2^ with respect to atomic positions R is proportional to
As such, the bias forces depend on the disagreement between the ensemble members’ force predictions.
In molecular liquids, forces predictions can be analytically decomposed into vibrational, rotational, and translational components, ?,? with the intermolecular forces typically being an order of magnitude smaller than the intramolecular ones. Hence, the intermolecular bias forces depend on the disagreement of the ensemble members’ intermolecular force predictions. This decomposition is illustrated in the parity plots shown in Section S6. Both the predicted intermolecular forces and their uncertainties remain small for a well-trained and well-calibrated MLIP. Since the UDD forces along particular degrees of freedom are directly linked to the model’s uncertainty along these degrees of freedom, the bias tends to vanish in the directions of slow, collective motion. This limits the effectiveness of UDD in enhancing sampling along shallow degrees of freedom.
Conclusion
4
In this work, we have introduced ERBS, a general-purpose enhanced sampling strategy that allows for the construction of diverse data sets for MLIPs. Starting from the model descriptor averaged over the whole system, ERBS identifies the slowest collective modes in descriptor space via principal component analysis. A bias potential is then constructed based on the recently introduced OPES-Explore method, which combines desirable attributes such as rapid exploration of the FES and limiting the maximum strength of the bias potential. Unlike uncertainty-based enhanced sampling methods, ERBS does not require a preliminary MLIP and can be used on top of any interatomic potential. Model independence makes ERBS particularly convenient in early stage data set generation or for systems where classical force fields or pretrained MLIPs are readily available.
We first evaluated ERBS using a classical force field for alanine dipeptide. ERBS was able to achieve up to 75% coverage of the dihedral angle space in just 80 ps of simulation, and the resulting data enabled the training of MLIPs that accurately reproduce the free energy landscape.
To demonstrate utility in an active learning setting, we applied ERBS to the iterative construction of a water data set. Compared to unbiased sampling, ERBS dramatically accelerated convergence toward accurate diffusion coefficients, matching the quality of models trained on literature data sets using an order of magnitude fewer training points. The model trained exclusively with MD samples, on the other hand, showed significant deviations from the reference diffusion coefficient throughout.
Finally, we compared the sampling behavior of ERBS and uncertainty-driven dynamics for the highly viscous ionic liquid BMIM^+^BF_4_ ^–^. We observed that ERBS more effectively enhances sampling along slow, intermolecular degrees of freedom. The accelerated sampling is attributed to the use of global, low-dimensional collective variables, which avoids overemphasizing high-frequency intramolecular modes, which are often the largest contributors to mode uncertainty and error. ERBS offers a fast method for enhancing the quality of training data both in static and active learning settings, even when no pretrained model is available. The role of a pretrained representation and extensions to constant-pressure simulations will be explored in future work.
While this study focuses on the exploration of molecular and liquid configuration spaces, the variance-based identification of collective variables is a phase-agnostic principle, suggesting that the method is applicable to solid-state sampling as well. We intend to further investigate this line of research in future work.
Lastly, the use of enhanced sampling techniques, such as ERBS, may also prove valuable in the context of constructing data sets for atomistic foundation models. By systematically exploring underrepresented regions of configuration space, these methods can help ensure broad coverage of structural motifs. This could lead to more compact and diverse data sets, ultimately improving the transferability and robustness of foundation models across domains.
Supplementary Material
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Unke O. T.Chmiela S.Sauceda H. E.Gastegger M.Poltavsky I.Schütt K. T.Tkatchenko A.Müller K.-R.Machine Learning Force Fields Chem. Rev.2021121101421018610.1021/acs.chemrev.0c 0111133705118 PMC 8391964 · doi ↗ · pubmed ↗
- 2Behler J.Csányi G.Machine Learning Potentials for Extended Systems: A Perspective European Physical Journal B 202194142
- 3Behler J.Parrinello M.Generalized Neural-Network Representation of High-Dimensional Potential-Energy Surfaces Phys. Rev. Lett.20079814640110.1103/Phys Rev Lett.98.14640117501293 · doi ↗ · pubmed ↗
- 4Behler J.Atom-centered symmetry functions for constructing high-dimensional neural network potentials J. Chem. Phys.201113407410610.1063/1.355371721341827 · doi ↗ · pubmed ↗
- 5Bartók, A. P. Gaussian Approximation Potential: An Interatomic Potential Derived from First Principles Quantum Mechanics; Ph D thesis, 2009.
- 6Bartók A. P.Payne M. C.Kondor R.Csányi G.Gaussian Approximation Potentials: The Accuracy of Quantum Mechanics, without the Electrons Phys. Rev. Lett.201010413640310.1103/Phys Rev Lett.104.13640320481899 · doi ↗ · pubmed ↗
- 7Bartók A. P.Kondor R.Csányi G.On representing chemical environments Phys. Rev. B 20138718411510.1103/Phys Rev B.87.184115 · doi ↗
- 8Chmiela S.Tkatchenko A.Sauceda H. E.Poltavsky I.Schütt K. T.Müller K. R.Machine Learning of Accurate Energy-Conserving Molecular Force Fields Sci. Adv.20173 e 160301510.1126/sciadv.160301528508076 PMC 5419702 · doi ↗ · pubmed ↗
