CubeBench: Diagnosing Interactive, Long-Horizon Spatial Reasoning Under Partial Observations

Huan-ang Gao; Zikang Zhang; Tianwei Luo; Kaisen Yang; Xinzhe Juan; Jiahao Qiu; Tianxing Chen; Bingxiang He; Hao Zhao; Hao Zhou; Shilong Liu; Mengdi Wang

arXiv:2512.23328·cs.AI·January 5, 2026

CubeBench: Diagnosing Interactive, Long-Horizon Spatial Reasoning Under Partial Observations

Huan-ang Gao, Zikang Zhang, Tianwei Luo, Kaisen Yang, Xinzhe Juan, Jiahao Qiu, Tianxing Chen, Bingxiang He, Hao Zhao, Hao Zhou, Shilong Liu, Mengdi Wang

PDF

Open Access

TL;DR

This paper introduces CubeBench, a benchmark for evaluating LLM agents' spatial reasoning and long-horizon planning in physical tasks using Rubik's Cube, revealing significant limitations in current models.

Contribution

It presents CubeBench, a novel diagnostic benchmark with a three-tiered framework to assess spatial reasoning, state tracking, and exploration in LLMs for physical tasks.

Findings

01

Leading LLMs fail all long-horizon tasks with 0% success

02

Identifies key cognitive bottlenecks in spatial reasoning and planning

03

Provides insights for developing more physically-grounded agents

Abstract

Large Language Model (LLM) agents, while proficient in the digital realm, face a significant gap in physical-world deployment due to the challenge of forming and maintaining a robust spatial mental model. We identify three core cognitive challenges hindering this transition: spatial reasoning, long-horizon state tracking via mental simulation, and active exploration under partial observation. To isolate and evaluate these faculties, we introduce CubeBench, a novel generative benchmark centered on the Rubik's Cube. CubeBench uses a three-tiered diagnostic framework that progressively assesses agent capabilities, from foundational state tracking with full symbolic information to active exploration with only partial visual data. Our experiments on leading LLMs reveal critical limitations, including a uniform 0.00% pass rate on all long-horizon tasks, exposing a fundamental failure in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Constraint Satisfaction and Optimization · AI-based Problem Solving and Planning