LLM-Based World Models Can Make Decisions Solely, But Rigorous Evaluations are Needed

Chang Yang; Xinrun Wang; Junzhe Jiang; Qinggang Zhang; Xiao Huang

arXiv:2411.08794·cs.AI·March 20, 2026

LLM-Based World Models Can Make Decisions Solely, But Rigorous Evaluations are Needed

Chang Yang, Xinrun Wang, Junzhe Jiang, Qinggang Zhang, Xiao Huang

PDF

Open Access

TL;DR

This paper evaluates Large Language Model-based world models in decision making across diverse environments, highlighting their strengths and limitations in various tasks without relying on traditional planning modules.

Contribution

It introduces a comprehensive decision-making evaluation framework for LLM-based world models, focusing on their performance in policy verification, action proposal, and planning tasks.

Findings

01

GPT-4o outperforms GPT-4o-mini, especially with domain knowledge.

02

Performance declines in long-term decision tasks.

03

Combining functionalities causes performance instability.

Abstract

World model emerges as a key module in decision making, where MuZero and Dreamer achieve remarkable successes in complex tasks. Recent work leverages Large Language Models (LLMs) as general world simulators to simulate the dynamics of the world due to their generalizability. LLMs also serve as the world model for deliberative reasoning in Reasoning via Planning (RAP) and Tree of Thought (ToT). However, the world models are either evaluated as a general world simulator, or as a functional module of the agent, i.e., predicting the transitions to assist the planning. In this work, we propose a comprehensive evaluation of the world models with LLMs from the decision making perspective. Specifically, we leverage the 31 diverse environments from (Wang et al., 2023;2024) and curate the rule-based policy of each environment for the diverse evaluation. Then, we design three main tasks, i.e.,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Processing Techniques · Fuzzy Logic and Control Systems

MethodsResidual Connection · Monte-Carlo Tree Search · Batch Normalization · Convolution · *Communicated@Fast*How Do I Communicate to Expedia? · Residual Block · Prioritized Experience Replay · Average Pooling · MuZero