Understanding and Mitigating Distribution Shifts For Machine Learning Force Fields
Tobias Kreiman, Aditi S. Krishnapriyan

TL;DR
This paper investigates how machine learning force fields (MLFFs) generalize across diverse chemical spaces, identifies challenges due to distribution shifts, and proposes test-time refinement methods to improve out-of-distribution performance.
Contribution
The paper introduces two novel test-time refinement strategies to mitigate distribution shifts in MLFFs, enhancing their ability to generalize beyond training data without expensive retraining.
Findings
Test-time refinement reduces errors on out-of-distribution systems.
Distribution shifts significantly impact MLFF performance.
Proposed methods are computationally inexpensive and effective.
Abstract
Machine Learning Force Fields (MLFFs) are a promising alternative to expensive ab initio quantum mechanical molecular simulations. Given the diversity of chemical spaces that are of interest and the cost of generating new data, it is important to understand how MLFFs generalize beyond their training distributions. In order to characterize and better understand distribution shifts in MLFFs, we conduct diagnostic experiments on chemical datasets, revealing common shifts that pose significant challenges, even for large foundation models trained on extensive data. Based on these observations, we hypothesize that current supervised training methods inadequately regularize MLFFs, resulting in overfitting and learning poor representations of out-of-distribution systems. We then propose two new methods as initial steps for mitigating distribution shifts for MLFFs. Our methods focus on test-time…
Peer Reviews
Decision·Submitted to ICLR 2025
* Given that expanding MLFF applicability to diverse chemical spaces is a primary goal within AI-for-Science, this approach is well-motivated and offers a promising step towards this objective. * The paper establishes practical criteria for identifying distribution shifts, which may inspire future work, and the proposed refinement strategies demonstrate meaningful performance improvements over large-scale foundation models and pre-trained baselines on established benchmarks. * The paper is well-
* The proposed strategies yield only modest performance gains over pre-trained models, falling short of significantly improving OOD sample prediction accuracy to levels comparable with in-distribution (ID) samples or chemical accuracy. This limitation may hinder the practical applicability of the methods and affect the perceived technical contribution. * The prior model used in the test-time training strategy (sGDML) may also face generalization issues on OOD samples, potentially limiting the ef
1. The paper studies existing foundation MLFFs and investigates their generalization ability, which could be helpful for the community. 2. The paper proposes an interesting approach to search for the most similar graph structure by tuning the radius for the test unseen molecules.
1. Even though the paper shows some improvements on unseen molecules, the performance is still magnitudes away from the desirable chemical accuracy. In other words, the proposed method could hardly be useful in practice. As the downstream task could be much more expensive (e.g., wet lab experiments or subject studies) than the simulations, accuracy is still the top priority for MLFF. 2. The actual distance for radius refinement is quite simple, and may not capture more detailed information in t
- The ML force field is a direction of significant interest to the AI for Science community and ICLR audience. - The OOD challenge is an important one for ML FFs, as we expect to use ML FFs in extrapolation tasks such as relaxation and MD simulations. - The proposed method improves model OOD performance on a variety of tasks. - The analysis of different flavors of OOD-ness and how the proposed method helps is valuable to the community.
The authors include an MD benchmark on the effectiveness of TTT which is great. However, it would be more interesting to test out the SPICE-trained models on MD simulation as a generalization to unseen molecules will be a more suited task for a SPICE-trained model. Further, it will be interesting to see how the OOD-ness of a test molecule impacts MD performance. The authors show a reduction in Force MAE with TTT and its correlation with the OOD-ness metrics such as force norm and graph Laplaci
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSimulation Techniques and Applications · Reservoir Engineering and Simulation Methods
MethodsFocus · ALIGN
