SimulCost: A Cost-Aware Benchmark and Toolkit for Automating Physics Simulations with LLMs

Yadi Cao; Sicheng Lai; Jiahe Huang; Yang Zhang; Zach Lawrence; Rohan Bhakta; Izzy F. Thomas; Mingyun Cao; Chung-Hao Tsai; Zihao Zhou; Yidong Zhao; Hao Liu; Alessandro Marinoni; Alexey Arefiev; Rose Yu

arXiv:2603.20253·physics.comp-ph·March 31, 2026

SimulCost: A Cost-Aware Benchmark and Toolkit for Automating Physics Simulations with LLMs

Yadi Cao, Sicheng Lai, Jiahe Huang, Yang Zhang, Zach Lawrence, Rohan Bhakta, Izzy F. Thomas, Mingyun Cao, Chung-Hao Tsai, Zihao Zhou, Yidong Zhao, Hao Liu, Alessandro Marinoni, Alexey Arefiev, Rose Yu

PDF

1 Repo

TL;DR

SimulCost is a benchmark and toolkit designed to evaluate and improve the cost-efficiency of LLMs in physics simulations, considering both accuracy and resource expenditure across multiple simulators.

Contribution

It introduces the first cost-sensitive physics simulation benchmark and toolkit, enabling analysis of LLM tuning strategies under realistic resource constraints.

Findings

01

Frontier LLMs achieve 46-64% success in single-round mode.

02

Multi-round mode improves success rates to 71-80%.

03

LLMs are 1.5-2.5x slower than traditional scanning, affecting cost-efficiency.

Abstract

Evaluating LLM agents for scientific tasks has focused on token costs while ignoring tool-use costs like simulation time and experimental resources. As a result, metrics like pass@k become impractical under realistic budget constraints. To address this gap, we introduce SimulCost, the first benchmark targeting cost-sensitive parameter tuning in physics simulations. SimulCost compares LLM tuning cost-sensitive parameters against traditional scanning approach in both accuracy and computational cost, spanning 2,916 single-round (initial guess) and 1,900 multi-round (adjustment by trial-and-error) tasks across 12 simulators from fluid dynamics, solid mechanics, and plasma physics. Each simulator's cost is analytically defined and platform-independent. Frontier LLMs achieve 46--64% success rates in single-round mode, dropping to 35--54% under high accuracy requirements, rendering their…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Rose-STL-Lab/SimulCost-Bench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.