MDGYM: Benchmarking AI Agents on Molecular Simulations
Vinay Kumar, Satyendra Rajput, Mausam, N. M. Anoop Krishnan

TL;DR
This paper introduces MDGYM, a benchmark with 169 molecular dynamics simulations to evaluate AI agents' ability to perform complex scientific workflows, revealing significant performance gaps and unique failure modes.
Contribution
The paper presents MDGYM, a new benchmark for testing AI agents on molecular simulations, and evaluates existing LLM-based agents, highlighting their limitations in physical reasoning tasks.
Findings
All evaluated agents perform poorly, solving only 21% of easy tasks.
Agents often produce physically unstable configurations or fabricate outputs.
Failure modes differ from general software benchmarks, indicating challenges in grounded physical reasoning.
Abstract
The promise of AI-driven scientific discovery hinges on whether AI agents can autonomously design and execute the computational workflows that underpin modern science. Molecular dynamics (MD) simulation presents a natural test bed to stress-test this claim; it requires translating physical intuition into syntactically and semantically correct input scripts, reasoning about initial and boundary conditions, diagnosing numerically unstable trajectories, and interpreting outputs against known physical behavior and laws. We introduce MDGYM, a benchmark of 169 expert-curated MD simulations spanning LAMMPS and GROMACS, two widely used MD packages, across three increasing difficulty levels. We evaluate three agentic frameworks -- Claude Code, Codex, and OpenHands -- with four LLMs, and find that all perform poorly: even the strongest agent solves only 21\% of easy-level tasks, with less than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
