Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages

Max Zuo; Francisco Piedrahita Velez; Xiaochen Li; Michael L. Littman; Stephen H. Bach

arXiv:2407.03321·cs.CL·November 12, 2025

Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages

Max Zuo, Francisco Piedrahita Velez, Xiaochen Li, Michael L. Littman, Stephen H. Bach

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces Planetarium, a comprehensive benchmark with a novel evaluation algorithm and a large dataset, to assess language models' ability to accurately translate natural language into structured planning languages like PDDL.

Contribution

The paper presents Planetarium, a new benchmark with a PDDL equivalence algorithm and a large dataset, to rigorously evaluate natural language to PDDL translation models.

Findings

01

96.1% of GPT-4o generated PDDL are syntactically parseable

02

94.4% are solvable, but only 24.8% are semantically correct

03

Highlights the gap between syntactic correctness and semantic accuracy

Abstract

Recent works have explored using language models for planning problems. One approach examines translating natural language descriptions of planning tasks into structured planning languages, such as the planning domain definition language (PDDL). Existing evaluation methods struggle to ensure semantic correctness and rely on simple or unrealistic datasets. To bridge this gap, we introduce \textit{Planetarium}, a benchmark designed to evaluate language models' ability to generate PDDL code from natural language descriptions of planning tasks. \textit{Planetarium} features a novel PDDL equivalence algorithm that flexibly evaluates the correctness of generated PDDL, along with a dataset of 145,918 text-to-PDDL pairs across 73 unique state combinations with varying levels of difficulty. Finally, we evaluate several API-access and open-weight language models that reveal this task's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

batsresearch/planetarium
noneOfficial

Datasets

BatsResearch/planetarium
dataset· 660 dl
660 dl

Videos

Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages· underline

Taxonomy

TopicsModel-Driven Software Engineering Techniques · AI-based Problem Solving and Planning