Seemingly Simple Planning Problems are Computationally Challenging: The Countdown Game
Michael Katz, Harsha Kokel, Sarath Sreedharan

TL;DR
This paper introduces a new challenging planning benchmark based on the Countdown game, enabling evaluation of planning models with verifiable outcomes and rich instance diversity.
Contribution
The paper proposes a novel Countdown-based benchmark for planning, with a formal analysis of its complexity and evaluation of existing LLM planning methods.
Findings
Countdown benchmark is NP-complete and computationally challenging.
Existing LLM-based planning methods perform poorly on the Countdown benchmark.
The benchmark offers a rich, verifiable, and natural language-compatible domain for planning evaluation.
Abstract
There is a broad consensus that the inability to form long-term plans is one of the key limitations of current foundational models and agents. However, the existing planning benchmarks remain woefully inadequate to truly measure their planning capabilities. Most existing benchmarks either focus on loosely defined tasks like travel planning or end up leveraging existing domains and problems from international planning competitions. While the former tasks are hard to formalize and verify, the latter were specifically designed to test and challenge the weaknesses of existing automated planners. To address these shortcomings, we propose a procedure for creating a planning benchmark centered around the game called Countdown, where a player is expected to form a target number from a list of input numbers through arithmetic operations. From a world-model perspective, each instance induces a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
