Cube Bench: A Benchmark for Spatial Visual Reasoning in MLLMs
Dhruv Anand, Ehsan Shareghi

TL;DR
Cube Bench is a comprehensive benchmark designed to evaluate spatial and sequential reasoning abilities of multimodal large language models using a Rubik's cube task, revealing significant performance drops with increased complexity.
Contribution
The paper introduces Cube Bench, a novel, standardized benchmark for assessing complex spatial reasoning in MLLMs across multiple skills and models.
Findings
Accuracy declines sharply with cube scramble depth.
Closed-source models outperform open-source counterparts.
Self-correction offers modest improvements but can cause overthinking.
Abstract
We introduce Cube Bench, a Rubik's-cube benchmark for evaluating spatial and sequential reasoning in multimodal large language models (MLLMs). The benchmark decomposes performance into five skills: (i) reconstructing cube faces from images and text, (ii) choosing the optimal next move, (iii) predicting the outcome of a candidate move without applying it, (iv) executing multi-step plans while recovering from mistakes, and (v) detecting and revising one's own errors. Using a shared set of scrambled cube states, identical prompts and parsers, and a single distance-to-solved metric, we compare recent MLLMs side by side as a function of scramble depth. Across seven MLLMs, accuracy drops sharply with depth; once a trajectory stalls or diverges, models rarely recover, and high face-reconstruction accuracy does not guarantee competent action selection or multi-step execution. A pronounced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Face recognition and analysis · Constraint Satisfaction and Optimization
