EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation

Yi Liu; TingFeng Hui; Wei Zhang; Li Sun; Ningxin Su; Jian Wang; Sen Su

arXiv:2605.07247·cs.AI·May 11, 2026

EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation

Yi Liu, TingFeng Hui, Wei Zhang, Li Sun, Ningxin Su, Jian Wang, Sen Su

PDF

1 Repo

TL;DR

EnvSimBench is a comprehensive benchmark designed to evaluate and enhance the ability of LLMs to accurately simulate environments, addressing hallucinations and inconsistencies that hinder scalable AI agent training.

Contribution

It introduces the first formal EnvSim Ability metric, a diverse benchmark dataset, and a constraint-driven pipeline to improve LLM-based environment simulation fidelity and efficiency.

Findings

01

All state-of-the-art LLMs struggle with multi-state updates, achieving high accuracy only on invariant states.

02

The proposed pipeline reduces hallucinations and increases environment synthesis yield by 6.8%.

03

The approach cuts simulation costs by over 90%.

Abstract

Scalable AI agents training relies on interactive environments that faithfully simulate the consequences of agent actions. Manually crafted environments are expensive to build, brittle to extend, and fundamentally limited in diversity. A promising direction is to replace manually crafted environments with LLM-simulated counterparts. However, this paradigm hinges on an unexamined core assumption: LLMs can accurately simulate environmental feedback. In practice, LLM-simulated environments suffer from hallucinations, logical inconsistencies, and silent state drift failures that corrupt agent reward signals and compound the construction costs that the paradigm was designed to eliminate. To address this gap, we propose EnvSimBench with four contributions: 1) We provide the first formal definition and operationalization of Environment Simulation Ability (EnvSim Ability) as a quantifiable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cookieApril/EnvSimBench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.