Text2World: Benchmarking Large Language Models for Symbolic World Model   Generation

Mengkang Hu; Tianxing Chen; Yude Zou; Yuheng Lei; Qiguang Chen; Ming; Li; Yao Mu; Hongyuan Zhang; Wenqi Shao; Ping Luo

arXiv:2502.13092·cs.CL·February 25, 2025

Text2World: Benchmarking Large Language Models for Symbolic World Model Generation

Mengkang Hu, Tianxing Chen, Yude Zou, Yuheng Lei, Qiguang Chen, Ming, Li, Yao Mu, Hongyuan Zhang, Wenqi Shao, Ping Luo

PDF

Open Access 1 Datasets

TL;DR

This paper introduces Text2World, a comprehensive benchmark for evaluating large language models' ability to generate symbolic world models from text, addressing previous evaluation challenges and providing insights into current capabilities and future improvements.

Contribution

The paper presents a new benchmark, Text2World, with diverse domains and robust metrics, and evaluates LLMs, revealing their limitations and exploring strategies to improve world modeling.

Findings

01

Reasoning models with reinforcement learning outperform others.

02

Even the best models show limited world modeling capabilities.

03

Strategies like test-time scaling and agent training can enhance performance.

Abstract

Recently, there has been growing interest in leveraging large language models (LLMs) to generate symbolic world models from textual descriptions. Although LLMs have been extensively explored in the context of world modeling, prior studies encountered several challenges, including evaluation randomness, dependence on indirect metrics, and a limited domain scope. To address these limitations, we introduce a novel benchmark, Text2World, based on planning domain definition language (PDDL), featuring hundreds of diverse domains and employing multi-criteria, execution-based metrics for a more robust evaluation. We benchmark current LLMs using Text2World and find that reasoning models trained with large-scale reinforcement learning outperform others. However, even the best-performing model still demonstrates limited capabilities in world modeling. Building on these insights, we examine several…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

xdzouyd/text2world
dataset· 6 dl
6 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques