FEM-Bench: A Structured Scientific Reasoning Benchmark for Evaluating Code-Generating LLMs
Saeed Mohammadzadeh, Erfan Hamdi, Joel Shor, Emma Lejeune

TL;DR
FEM-Bench is a new benchmark for evaluating AI models' ability to generate correct finite element method code, focusing on physical modeling tasks inspired by computational mechanics, revealing current models' limitations.
Contribution
Introduces FEM-Bench, a structured benchmark with physics-based tasks for assessing LLMs' scientific code generation in computational mechanics.
Findings
State-of-the-art LLMs struggle to reliably solve all tasks.
Gemini 3 Pro completed 30/33 tasks at least once in five attempts.
GPT-5 achieved an average joint success rate of 73.8%.
Abstract
As LLMs advance their reasoning capabilities about the physical world, the absence of rigorous benchmarks for evaluating their ability to generate scientifically valid physical models has become a critical gap. Computational mechanics, which develops and applies mathematical models and numerical methods to predict the behavior of physical systems under forces, deformation, and constraints, provides an ideal foundation for structured scientific reasoning evaluation. Problems follow clear mathematical structure, enforce strict physical and numerical constraints, and support objective verification. The discipline requires constructing explicit models of physical systems and reasoning about geometry, spatial relationships, and material behavior, connecting directly to emerging AI goals in physical reasoning and world modeling. We introduce FEM-Bench, a computational mechanics benchmark…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Scientific Computing and Data Management · Model Reduction and Neural Networks
