Foundation-Model Surrogates Enable Data-Efficient Active Learning for Materials Discovery

Jeffrey Hu; Rongzhi Dong; Ying Feng; Ming Hu; Jianjun Hu

arXiv:2603.12567·cond-mat.mtrl-sci·March 25, 2026

Foundation-Model Surrogates Enable Data-Efficient Active Learning for Materials Discovery

Jeffrey Hu, Rongzhi Dong, Ying Feng, Ming Hu, Jianjun Hu

PDF

Open Access

TL;DR

This paper introduces ICAL, a novel active learning approach using a foundation model as a surrogate, significantly improving data efficiency and uncertainty calibration in materials discovery tasks.

Contribution

The paper presents ICAL, replacing traditional surrogates with a pre-trained transformer model, enabling effective Bayesian inference without retraining, and demonstrating superior performance in materials datasets.

Findings

01

ICAL outperforms GP and RF on 8 of 10 datasets.

02

Achieves 52% reduction in extra evaluations compared to GP.

03

Exhibits superior uncertainty calibration, lowest Negative Log-Likelihood.

Abstract

Active learning (AL) has emerged as a powerful paradigm for accelerating materials discovery by iteratively steering experiments toward promising candidates, reducing the number of costly synthesis-and-characterization cycles needed to identify optimal materials. However, current AL relies predominantly on Gaussian Process (GP) and Random Forest (RF) surrogates, which suffer from complementary limitations: GP underfits complex composition-property landscapes due to rigid kernel assumptions, while RF produces unreliable heuristic uncertainty estimates in small-data regimes. This small-data challenge is pervasive in materials science, making reliable surrogate modeling extremely difficult with models trained from scratch on each new dataset. Here we propose In-Context Active Learning (ICAL), which addresses this bottleneck by replacing conventional surrogates with TabPFN, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Gaussian Processes and Bayesian Inference · Domain Adaptation and Few-Shot Learning