CubeBench: Diagnosing Interactive, Long-Horizon Spatial Reasoning Under Partial Observations
Huan-ang Gao, Zikang Zhang, Tianwei Luo, Kaisen Yang, Xinzhe Juan, Jiahao Qiu, Tianxing Chen, Bingxiang He, Hao Zhao, Hao Zhou, Shilong Liu, Mengdi Wang

TL;DR
This paper introduces CubeBench, a benchmark for evaluating LLM agents' spatial reasoning and long-horizon planning in physical tasks using Rubik's Cube, revealing significant limitations in current models.
Contribution
It presents CubeBench, a novel diagnostic benchmark with a three-tiered framework to assess spatial reasoning, state tracking, and exploration in LLMs for physical tasks.
Findings
Leading LLMs fail all long-horizon tasks with 0% success
Identifies key cognitive bottlenecks in spatial reasoning and planning
Provides insights for developing more physically-grounded agents
Abstract
Large Language Model (LLM) agents, while proficient in the digital realm, face a significant gap in physical-world deployment due to the challenge of forming and maintaining a robust spatial mental model. We identify three core cognitive challenges hindering this transition: spatial reasoning, long-horizon state tracking via mental simulation, and active exploration under partial observation. To isolate and evaluate these faculties, we introduce CubeBench, a novel generative benchmark centered on the Rubik's Cube. CubeBench uses a three-tiered diagnostic framework that progressively assesses agent capabilities, from foundational state tracking with full symbolic information to active exploration with only partial visual data. Our experiments on leading LLMs reveal critical limitations, including a uniform 0.00% pass rate on all long-horizon tasks, exposing a fundamental failure in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Constraint Satisfaction and Optimization · AI-based Problem Solving and Planning
