FEM-Bench: A Structured Scientific Reasoning Benchmark for Evaluating Code-Generating LLMs

Saeed Mohammadzadeh; Erfan Hamdi; Joel Shor; Emma Lejeune

arXiv:2512.20732·cs.LG·December 25, 2025

FEM-Bench: A Structured Scientific Reasoning Benchmark for Evaluating Code-Generating LLMs

Saeed Mohammadzadeh, Erfan Hamdi, Joel Shor, Emma Lejeune

PDF

Open Access

TL;DR

FEM-Bench is a new benchmark for evaluating AI models' ability to generate correct finite element method code, focusing on physical modeling tasks inspired by computational mechanics, revealing current models' limitations.

Contribution

Introduces FEM-Bench, a structured benchmark with physics-based tasks for assessing LLMs' scientific code generation in computational mechanics.

Findings

01

State-of-the-art LLMs struggle to reliably solve all tasks.

02

Gemini 3 Pro completed 30/33 tasks at least once in five attempts.

03

GPT-5 achieved an average joint success rate of 73.8%.

Abstract

As LLMs advance their reasoning capabilities about the physical world, the absence of rigorous benchmarks for evaluating their ability to generate scientifically valid physical models has become a critical gap. Computational mechanics, which develops and applies mathematical models and numerical methods to predict the behavior of physical systems under forces, deformation, and constraints, provides an ideal foundation for structured scientific reasoning evaluation. Problems follow clear mathematical structure, enforce strict physical and numerical constraints, and support objective verification. The discipline requires constructing explicit models of physical systems and reasoning about geometry, spatial relationships, and material behavior, connecting directly to emerging AI goals in physical reasoning and world modeling. We introduce FEM-Bench, a computational mechanics benchmark…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Scientific Computing and Data Management · Model Reduction and Neural Networks