SymPyBench: A Dynamic Benchmark for Scientific Reasoning with Executable Python Code

Shima Imani; Seungwhan Moon; Adel Ahmadyan; Lu Zhang; Kirmani Ahmed; Babak Damavandi

arXiv:2512.05954·cs.AI·December 8, 2025

SymPyBench: A Dynamic Benchmark for Scientific Reasoning with Executable Python Code

Shima Imani, Seungwhan Moon, Adel Ahmadyan, Lu Zhang, Kirmani Ahmed, Babak Damavandi

PDF

Open Access 1 Video

TL;DR

SymPyBench is a comprehensive, executable Python-based benchmark with diverse physics problems designed to evaluate and improve scientific reasoning in AI models, introducing new metrics for variability and uncertainty.

Contribution

It presents a large-scale, parameterized physics problem benchmark with novel evaluation metrics, enabling dynamic testing of reasoning skills in AI models.

Findings

01

State-of-the-art models show strengths in some reasoning tasks.

02

The benchmark reveals limitations in models' consistency and uncertainty handling.

03

New metrics provide deeper insights into model reasoning variability.

Abstract

We introduce, a large-scale synthetic benchmark of 15,045 university-level physics problems (90/10% train/test split). Each problem is fully parameterized, supporting an effectively infinite range of input configurations, and is accompanied by structured, step-by-step reasoning and executable Python code that produces the ground-truth solution for any parameter set. The benchmark contains three question types: MC-Symbolic (multiple-choice with symbolic options), MC-Numerical (multiple-choice with numerical options), and free-form (open-ended responses). These diverse formats test complementary reasoning skills. By leveraging the dynamic, code-driven nature of the benchmark, we introduce three novel evaluation metrics in addition to standard accuracy: Consistency Score, Failure Rate, and Confusion Rate, that quantify variability and uncertainty across problem variants. Experiments with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SymPyBench: A Dynamic Benchmark for Scientific Reasoning with Executable Python Code· underline

Taxonomy

TopicsMachine Learning in Materials Science · Explainable Artificial Intelligence (XAI) · Topic Modeling