Cube Bench: A Benchmark for Spatial Visual Reasoning in MLLMs

Dhruv Anand; Ehsan Shareghi

arXiv:2512.20595·cs.CL·December 24, 2025

Cube Bench: A Benchmark for Spatial Visual Reasoning in MLLMs

Dhruv Anand, Ehsan Shareghi

PDF

Open Access

TL;DR

Cube Bench is a comprehensive benchmark designed to evaluate spatial and sequential reasoning abilities of multimodal large language models using a Rubik's cube task, revealing significant performance drops with increased complexity.

Contribution

The paper introduces Cube Bench, a novel, standardized benchmark for assessing complex spatial reasoning in MLLMs across multiple skills and models.

Findings

01

Accuracy declines sharply with cube scramble depth.

02

Closed-source models outperform open-source counterparts.

03

Self-correction offers modest improvements but can cause overthinking.

Abstract

We introduce Cube Bench, a Rubik's-cube benchmark for evaluating spatial and sequential reasoning in multimodal large language models (MLLMs). The benchmark decomposes performance into five skills: (i) reconstructing cube faces from images and text, (ii) choosing the optimal next move, (iii) predicting the outcome of a candidate move without applying it, (iv) executing multi-step plans while recovering from mistakes, and (v) detecting and revising one's own errors. Using a shared set of scrambled cube states, identical prompts and parsers, and a single distance-to-solved metric, we compare recent MLLMs side by side as a function of scramble depth. Across seven MLLMs, accuracy drops sharply with depth; once a trajectory stalls or diverges, models rarely recover, and high face-reconstruction accuracy does not guarantee competent action selection or multi-step execution. A pronounced…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Face recognition and analysis · Constraint Satisfaction and Optimization