MDGYM: Benchmarking AI Agents on Molecular Simulations

Vinay Kumar; Satyendra Rajput; Mausam; N. M. Anoop Krishnan

arXiv:2605.08941·cs.AI·May 12, 2026

MDGYM: Benchmarking AI Agents on Molecular Simulations

Vinay Kumar, Satyendra Rajput, Mausam, N. M. Anoop Krishnan

PDF

TL;DR

This paper introduces MDGYM, a benchmark with 169 molecular dynamics simulations to evaluate AI agents' ability to perform complex scientific workflows, revealing significant performance gaps and unique failure modes.

Contribution

The paper presents MDGYM, a new benchmark for testing AI agents on molecular simulations, and evaluates existing LLM-based agents, highlighting their limitations in physical reasoning tasks.

Findings

01

All evaluated agents perform poorly, solving only 21% of easy tasks.

02

Agents often produce physically unstable configurations or fabricate outputs.

03

Failure modes differ from general software benchmarks, indicating challenges in grounded physical reasoning.

Abstract

The promise of AI-driven scientific discovery hinges on whether AI agents can autonomously design and execute the computational workflows that underpin modern science. Molecular dynamics (MD) simulation presents a natural test bed to stress-test this claim; it requires translating physical intuition into syntactically and semantically correct input scripts, reasoning about initial and boundary conditions, diagnosing numerically unstable trajectories, and interpreting outputs against known physical behavior and laws. We introduce MDGYM, a benchmark of 169 expert-curated MD simulations spanning LAMMPS and GROMACS, two widely used MD packages, across three increasing difficulty levels. We evaluate three agentic frameworks -- Claude Code, Codex, and OpenHands -- with four LLMs, and find that all perform poorly: even the strongest agent solves only 21\% of easy-level tasks, with less than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.