PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models

Ziliang Zhao; Zenan Xu; Shuting Wang; Hongjin Qian; Yan Lei; Minda Hu; Zhao Wang; Shihan Dou; Zhicheng Dou; Pluto Zhou

arXiv:2605.20873·cs.AI·May 21, 2026

PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models

Ziliang Zhao, Zenan Xu, Shuting Wang, Hongjin Qian, Yan Lei, Minda Hu, Zhao Wang, Shihan Dou, Zhicheng Dou, Pluto Zhou

PDF

1 Datasets

TL;DR

PlanningBench is a framework that generates scalable, diverse, and verifiable planning data from real scenarios, enabling better evaluation and training of large language models in complex planning tasks.

Contribution

It introduces a structured taxonomy and a constraint-driven synthesis pipeline for controllable, realistic planning data generation, enhancing scalability and verifiability.

Findings

01

Current models struggle with coupled constraints in planning.

02

Reinforcement learning on PlanningBench data improves model performance.

03

Well-specified solutions lead to more stable training dynamics.

Abstract

Planning is a fundamental capability for large language models (LLMs) because such complex tasks require models to coordinate goals, constraints, resources, and long-term consequences into executable and verifiable solutions. Existing planning benchmarks, however, usually treat planning data as fixed collections of instances rather than controllable generation targets. This limits scenario coverage, ties difficulty to surface-level proxies rather than structural sources, and offers limited support for scalable generation, automatic verification, or planning-oriented training. We introduce PlanningBench, a framework for generating scalable, diverse, and verifiable planning data for both evaluation and training. PlanningBench starts from real planning scenarios and abstracts practical workflows into a structured taxonomy of more than 30 task types, subtasks, constraint families, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

tencent/PlanningBench
dataset· 3 dl
3 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.