An information-matching approach to optimal experimental design and active learning

Yonatan Kurniawan (1); Tracianne B. Neilsen (1); Benjamin L. Francis (2); Alex M. Stankovic (3); Mingjian Wen (4); Ilia Nikiforov (5); Ellad B. Tadmor (5); Vasily V. Bulatov (6); Vincenzo Lordi (6); Mark K. Transtrum (1; 2; and 3) ((1) Brigham Young University; Provo; UT; USA; (2) Achilles Heel Technologies; Orem; UT; USA; (3) SLAC National Accelerator Laboratory; Menlo Park; CA; USA; (4) University of Electronic Science; Technology of China; Chengdu; China; (5) University of Minnesota; Minneapolis; MN; USA; (6) Lawrence Livermore National Laboratory)

arXiv:2411.02740·cs.LG·May 8, 2026

An information-matching approach to optimal experimental design and active learning

Yonatan Kurniawan (1), Tracianne B. Neilsen (1), Benjamin L. Francis (2), Alex M. Stankovic (3), Mingjian Wen (4), Ilia Nikiforov (5), Ellad B. Tadmor (5), Vasily V. Bulatov (6), Vincenzo Lordi (6), Mark K. Transtrum (1, 2, and 3) ((1) Brigham Young University, Provo, UT, USA

PDF

TL;DR

This paper presents an information-matching criterion based on the Fisher Information Matrix to select optimal training data, improving model accuracy efficiently across various scientific fields and active learning applications.

Contribution

It introduces a scalable convex optimization approach for data selection that focuses on informative data for parameter inference relevant to quantities of interest.

Findings

01

Small, optimally selected datasets suffice for accurate predictions.

02

The method is effective across diverse scientific modeling problems.

03

Active learning with this criterion enhances data efficiency.

Abstract

The efficacy of mathematical models heavily depends on the quality of the training data, yet collecting sufficient data is often expensive and challenging. Many modeling applications require inferring parameters only as a means to predict other quantities of interest (QoI). Because models often contain many unidentifiable (sloppy) parameters, QoIs often depend on a relatively small number of parameter combinations. Therefore, we introduce an information-matching criterion based on the Fisher Information Matrix to select the most informative training data from a candidate pool. This method ensures that the selected data contain sufficient information to learn only those parameters that are needed to constrain downstream QoIs. It is formulated as a convex optimization problem, making it scalable to large models and datasets. We demonstrate the effectiveness of this approach across various…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.