SimulCost: A Cost-Aware Benchmark and Toolkit for Automating Physics Simulations with LLMs
Yadi Cao, Sicheng Lai, Jiahe Huang, Yang Zhang, Zach Lawrence, Rohan Bhakta, Izzy F. Thomas, Mingyun Cao, Chung-Hao Tsai, Zihao Zhou, Yidong Zhao, Hao Liu, Alessandro Marinoni, Alexey Arefiev, Rose Yu

TL;DR
SimulCost is a benchmark and toolkit designed to evaluate and improve the cost-efficiency of LLMs in physics simulations, considering both accuracy and resource expenditure across multiple simulators.
Contribution
It introduces the first cost-sensitive physics simulation benchmark and toolkit, enabling analysis of LLM tuning strategies under realistic resource constraints.
Findings
Frontier LLMs achieve 46-64% success in single-round mode.
Multi-round mode improves success rates to 71-80%.
LLMs are 1.5-2.5x slower than traditional scanning, affecting cost-efficiency.
Abstract
Evaluating LLM agents for scientific tasks has focused on token costs while ignoring tool-use costs like simulation time and experimental resources. As a result, metrics like pass@k become impractical under realistic budget constraints. To address this gap, we introduce SimulCost, the first benchmark targeting cost-sensitive parameter tuning in physics simulations. SimulCost compares LLM tuning cost-sensitive parameters against traditional scanning approach in both accuracy and computational cost, spanning 2,916 single-round (initial guess) and 1,900 multi-round (adjustment by trial-and-error) tasks across 12 simulators from fluid dynamics, solid mechanics, and plasma physics. Each simulator's cost is analytically defined and platform-independent. Frontier LLMs achieve 46--64% success rates in single-round mode, dropping to 35--54% under high accuracy requirements, rendering their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
