SymPyBench: A Dynamic Benchmark for Scientific Reasoning with Executable Python Code
Shima Imani, Seungwhan Moon, Adel Ahmadyan, Lu Zhang, Kirmani Ahmed, Babak Damavandi

TL;DR
SymPyBench is a comprehensive, executable Python-based benchmark with diverse physics problems designed to evaluate and improve scientific reasoning in AI models, introducing new metrics for variability and uncertainty.
Contribution
It presents a large-scale, parameterized physics problem benchmark with novel evaluation metrics, enabling dynamic testing of reasoning skills in AI models.
Findings
State-of-the-art models show strengths in some reasoning tasks.
The benchmark reveals limitations in models' consistency and uncertainty handling.
New metrics provide deeper insights into model reasoning variability.
Abstract
We introduce, a large-scale synthetic benchmark of 15,045 university-level physics problems (90/10% train/test split). Each problem is fully parameterized, supporting an effectively infinite range of input configurations, and is accompanied by structured, step-by-step reasoning and executable Python code that produces the ground-truth solution for any parameter set. The benchmark contains three question types: MC-Symbolic (multiple-choice with symbolic options), MC-Numerical (multiple-choice with numerical options), and free-form (open-ended responses). These diverse formats test complementary reasoning skills. By leveraging the dynamic, code-driven nature of the benchmark, we introduce three novel evaluation metrics in addition to standard accuracy: Consistency Score, Failure Rate, and Confusion Rate, that quantify variability and uncertainty across problem variants. Experiments with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMachine Learning in Materials Science · Explainable Artificial Intelligence (XAI) · Topic Modeling
