LEAP: Trajectory-Level Evaluation of LLMs in Iterative Scientific Design
Marilyn Zhang, Tianfeng Chen, Fabi\'an Barzuna, Ankita Rathod, Mark E. Whiting

TL;DR
This paper introduces LEAPBench, a new framework for evaluating LLMs in iterative scientific design by focusing on learning trajectories, revealing efficiency insights overlooked by traditional outcome-based metrics.
Contribution
The paper proposes a trajectory-based evaluation metric and benchmark for LLMs in scientific design, challenging existing outcome-focused assessments and demonstrating its impact on model selection and training.
Findings
Switching to trajectory scoring changes the best-model decision on 53% of tasks.
LLMs do not outperform classical Bayesian optimization baselines.
Trajectory-based training with RL improves performance on 14 out of 21 tasks.
Abstract
LLMs are increasingly deployed in autonomous laboratories, under the assumption that their domain priors and reasoning over iterative feedback let them converge on good designs in fewer iterations than feedback-only baselines. Current iterative scientific design benchmarks, however, score only outcome snapshots at fixed horizons. This leaves the learning trajectory unmeasured, even though the trajectory is what captures learning efficiency, where each iteration saved is a real saving in cost and time. Motivated by this, we examine three evaluation choices that change the conclusions one draws about LLM learning efficiency in iterative scientific design: what to measure, what baseline to compare against, and what to ground against. We introduce LEAPBench, Learning Efficiency in Adaptive Processes, a 55-task framework that pairs a best-so-far area under the curve (AUC) trajectory metric…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
