Beyond State Consistency: Behavior Consistency in Text-Based World Models
Youling Huang, Guanqiao Chen, Junchi Yao, Lu Wang, Fangkai Yang, Chao Du, ChenZhuo Zhao, Pu Zhao, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang

TL;DR
This paper introduces Behavior Consistency Reward (BehR), a new training paradigm for text-based world models that better aligns agent behavior with real environments, improving long-term consistency and evaluation metrics.
Contribution
It proposes a behavior-aligned training method using BehR to enhance the functional consistency of text-based world models beyond traditional single-step metrics.
Findings
BehR-based training improves long-term alignment in WebShop and TextWorld.
Models trained with BehR achieve lower false positives in offline evaluation.
Modest gains observed in inference-time lookahead planning.
Abstract
World models have been emerging as critical components for assessing the consequences of actions generated by interactive agents in online planning and offline evaluation. In text-based environments, world models are typically evaluated and trained with single-step metrics such as Exact Match, aiming to improve the similarity between predicted and real-world states, but such metrics have been shown to be insufficient for capturing actual agent behavior. To address this issue, we introduce a new behavior-aligned training paradigm aimed at improving the functional consistency between the world model and the real environment. This paradigm focuses on optimizing a tractable step-level metric named Behavior Consistency Reward (BehR), which measures how much the likelihood of a logged next action changes between the real state and the world-model-predicted state under a frozen Reference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
