Text2World: Benchmarking Large Language Models for Symbolic World Model Generation
Mengkang Hu, Tianxing Chen, Yude Zou, Yuheng Lei, Qiguang Chen, Ming, Li, Yao Mu, Hongyuan Zhang, Wenqi Shao, Ping Luo

TL;DR
This paper introduces Text2World, a comprehensive benchmark for evaluating large language models' ability to generate symbolic world models from text, addressing previous evaluation challenges and providing insights into current capabilities and future improvements.
Contribution
The paper presents a new benchmark, Text2World, with diverse domains and robust metrics, and evaluates LLMs, revealing their limitations and exploring strategies to improve world modeling.
Findings
Reasoning models with reinforcement learning outperform others.
Even the best models show limited world modeling capabilities.
Strategies like test-time scaling and agent training can enhance performance.
Abstract
Recently, there has been growing interest in leveraging large language models (LLMs) to generate symbolic world models from textual descriptions. Although LLMs have been extensively explored in the context of world modeling, prior studies encountered several challenges, including evaluation randomness, dependence on indirect metrics, and a limited domain scope. To address these limitations, we introduce a novel benchmark, Text2World, based on planning domain definition language (PDDL), featuring hundreds of diverse domains and employing multi-criteria, execution-based metrics for a more robust evaluation. We benchmark current LLMs using Text2World and find that reasoning models trained with large-scale reinforcement learning outperform others. However, even the best-performing model still demonstrates limited capabilities in world modeling. Building on these insights, we examine several…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
