MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning
Yulun Jiang, Yekun Chai, Maria Brbi\'c, Michael Moor

TL;DR
MARBLE is a challenging multimodal reasoning benchmark designed to evaluate the step-by-step reasoning and planning abilities of multimodal language models in complex spatial, visual, and physical environments, revealing significant performance gaps.
Contribution
The paper introduces MARBLE, a novel benchmark with two complex tasks, to assess and highlight the limitations of current multimodal language models in reasoning and planning.
Findings
Current MLLMs perform poorly on MARBLE tasks.
Models achieve near-random performance on complex tasks.
Perception remains a bottleneck in multimodal reasoning.
Abstract
The ability to process information from multiple modalities and to reason through it step-by-step remains a critical challenge in advancing artificial intelligence. However, existing reasoning benchmarks focus on text-only reasoning, or employ multimodal questions that can be answered by directly retrieving information from a non-text modality. Thus, complex reasoning remains poorly understood in multimodal domains. Here, we present MARBLE, a challenging multimodal reasoning benchmark that is designed to scrutinize multimodal language models (MLLMs) in their ability to carefully reason step-by-step through complex multimodal problems and environments. MARBLE is composed of two highly challenging tasks, M-Portal and M-Cube, that require the crafting and understanding of multistep plans under spatial, visual, and physical constraints. We find that current MLLMs perform poorly on MARBLE --…
Peer Reviews
Decision·Submitted to ICLR 2026
Clear and well-structured writing. The manuscript is clearly written and easy to follow. The benchmark design and evaluation methodology are all presented in a logical, concise manner, supported by informative figures and examples. Thoughtful dataset and task construction. The benchmark curation process is technically sound. The detailed pipeline for constructing environments and task sequences provides valuable insight for future multimodal benchmark development. Comprehensive evaluation and
Unclear motivation behind task combination. While each task individually provides meaningful evaluation, the rationale for combining M-Portal, M-Cube, and M-Maze into a single benchmark is not fully articulated. These settings are fairly distinct in format and objective, and similar concepts have appeared in prior embodied and spatial reasoning benchmarks. The paper would benefit from a stronger justification for why combining these tasks yields emergent value beyond scaling and aggregation. Ad
1. The paper is well written and easy to follow. 2. The three tasks probe different mixtures of perception, spatial reasoning, combinatorics, and rule-driven planning. The two-tier difficulty in each task (e.g., CUBE vs CUBE-easy, MAZE vs MAZE-easy) cleanly exposes where models fail (perception vs search vs dynamics). 3. The paper isolates perception with a conversion task (image to edge arrays), showing around 70–76% per-cell accuracy and 0% piece-level accuracy across models, which plausibly e
1. Plan-correctness relies on mixing up to five independent mistake steps to produce 2^5 candidates with 1 positive, which is an extreme imbalance that can confound minority-class F1 and encourage shortcut cues in negatives. 2. Human results are reported from 2–3 experienced players; this small N, without variance/confidence intervals, makes it hard to contextualize model gaps especially on hard tasks. 4. While the image to array conversion task is informative, it’s still a single proxy. Conside
This benchmark is distinctive in its design, incorporating video game environments to construct datasets that evaluate the multimodal reasoning abilities of MLLMs. It features long-horizon tasks with large search spaces, providing a challenging testbed. The benchmark provides a comprehensive evaluation of various MLLMs, revealing that visual perception and planning remain critical bottlenecks in multimodal reasoning.
The main focus of this work is on providing a benchmark to evaluate the capabilities of existing MLLMs. It feels largely engineering-oriented, emphasizing data curation rather than introducing new methods to enhance MLLM performance, which limits the conceptual contribution of the paper. While the overall writing quality is good, it could be made more concise. The evaluation setup appears somewhat specialized, focusing on a narrow subset of tasks based on simulated images, which may limit the
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Constraint Satisfaction and Optimization · Spatial Cognition and Navigation
