Detecting and Characterizing Planning in Language Models
Jatin Nainani, Sankaran Vaidyanathan, Connor Watts, Andre N. Assis, Alice Rigg

TL;DR
This paper develops a formal framework and an annotation pipeline to detect planning behaviors in large language models, revealing that planning is not universal and is influenced by instruction tuning across different tasks.
Contribution
It introduces a causally grounded criteria and semi-automated pipeline for identifying planning in LLMs, applied to multiple models and tasks.
Findings
Planning is not universal across models and tasks.
Instruction tuning refines existing planning behaviors.
Models switch between planning and improvisation dynamically.
Abstract
Modern large language models (LLMs) have demonstrated impressive performance across a wide range of multi-step reasoning tasks. Recent work suggests that LLMs may perform planning - selecting a future target token in advance and generating intermediate tokens that lead towards it - rather than merely improvising one token at a time. However, existing studies assume fixed planning horizons and often focus on single prompts or narrow domains. To distinguish planning from improvisation across models and tasks, we present formal and causally grounded criteria for detecting planning and operationalize them as a semi-automated annotation pipeline. We apply this pipeline to both base and instruction-tuned Gemma-2-2B models on the MBPP code generation benchmark and a poem generation task where Claude 3.5 Haiku was previously shown to plan. Our findings show that planning is not universal:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
