Frontier Large Language Models Rival State-of-the-Art Planners
Augusto B. Corr\^ea, Andr\'e G. Pereira, Jendrik Seipp

TL;DR
Recent frontier large language models, especially Gemini 3.1 Pro and GPT-5, demonstrate significant improvements in solving complex planning tasks, surpassing previous limitations of LLMs.
Contribution
This study shows that the latest frontier LLMs can effectively solve challenging planning tasks, challenging prior beliefs about their capabilities.
Findings
Gemini 3.1 Pro solves 245 out of 360 tasks, outperforming classical planners.
GPT-5 achieves performance comparable to state-of-the-art classical planners.
Performance declines when semantic information is removed, but Gemini 3.1 Pro remains competitive.
Abstract
A series of influential studies established that large language models cannot reliably solve even simple planning tasks. We show that the latest generation of frontier models overturns this conclusion. We evaluate three families of frontier LLMs on a challenging set of planning tasks based on the most recent International Planning Competition following rigorous evaluation guidelines: solutions are verified with a validation tool, tasks are freshly created to avoid data contamination, and performance is compared against state-of-the-art classical planners. On standard task descriptions, Gemini 3.1 Pro outperforms the strongest planner baseline (245 vs. 234 solved tasks out of 360), while GPT-5 achieves comparable performance to the baselines. When all semantic information is obfuscated from the descriptions to test for pure symbolic planning, performance degrades but Gemini 3.1 Pro…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · AI-based Problem Solving and Planning · Topic Modeling
