TCP: a Benchmark for Temporal Constraint-Based Planning

Zifeng Ding; Sikuan Yan; Zhangdie Yuan; Xianglong Hu; Fangru Lin; Andreas Vlachos

arXiv:2505.19927·cs.AI·October 14, 2025

TCP: a Benchmark for Temporal Constraint-Based Planning

Zifeng Ding, Sikuan Yan, Zhangdie Yuan, Xianglong Hu, Fangru Lin, Andreas Vlachos

PDF

Open Access 1 Datasets

TL;DR

The paper introduces TCP, a new benchmark that evaluates large language models' ability to perform complex temporal reasoning and planning within realistic dialogue scenarios, revealing current limitations.

Contribution

It presents a novel benchmark combining temporal reasoning and planning in dialogue form, with a comprehensive construction process and human validation.

Findings

01

State-of-the-art LLMs struggle with TCP tasks.

02

TCP reveals limitations in LLMs' temporal planning abilities.

03

Benchmark is open source for future research.

Abstract

Temporal reasoning and planning are essential capabilities for large language models (LLMs), yet most existing benchmarks evaluate them in isolation and under limited forms of complexity. To address this gap, we introduce the Temporal Constraint-based Planning (TCP) benchmark that jointly assesses both capabilities. Each instance in TCP features a naturalistic dialogue around a collaborative project, where diverse and interdependent temporal constraints are explicitly or implicitly expressed, and models must infer an optimal schedule that satisfies all constraints. To construct TCP, we generate abstract problem prototypes that are then paired with realistic scenarios from various domains and enriched into dialogues using an LLM. A human quality check is performed on a sampled subset to confirm the reliability of our benchmark. We evaluate state-of-the-art LLMs and find that even the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Beanbagdzf/TCP
dataset· 30 dl
30 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFormal Methods in Verification · Logic, programming, and type systems · Advanced Database Systems and Queries