MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning

Yulun Jiang; Yekun Chai; Maria Brbi\'c; Michael Moor

arXiv:2506.22992·cs.AI·July 1, 2025

MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning

Yulun Jiang, Yekun Chai, Maria Brbi\'c, Michael Moor

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

MARBLE is a challenging multimodal reasoning benchmark designed to evaluate the step-by-step reasoning and planning abilities of multimodal language models in complex spatial, visual, and physical environments, revealing significant performance gaps.

Contribution

The paper introduces MARBLE, a novel benchmark with two complex tasks, to assess and highlight the limitations of current multimodal language models in reasoning and planning.

Findings

01

Current MLLMs perform poorly on MARBLE tasks.

02

Models achieve near-random performance on complex tasks.

03

Perception remains a bottleneck in multimodal reasoning.

Abstract

The ability to process information from multiple modalities and to reason through it step-by-step remains a critical challenge in advancing artificial intelligence. However, existing reasoning benchmarks focus on text-only reasoning, or employ multimodal questions that can be answered by directly retrieving information from a non-text modality. Thus, complex reasoning remains poorly understood in multimodal domains. Here, we present MARBLE, a challenging multimodal reasoning benchmark that is designed to scrutinize multimodal language models (MLLMs) in their ability to carefully reason step-by-step through complex multimodal problems and environments. MARBLE is composed of two highly challenging tasks, M-Portal and M-Cube, that require the crafting and understanding of multistep plans under spatial, visual, and physical constraints. We find that current MLLMs perform poorly on MARBLE --…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 5

Strengths

Clear and well-structured writing. The manuscript is clearly written and easy to follow. The benchmark design and evaluation methodology are all presented in a logical, concise manner, supported by informative figures and examples. Thoughtful dataset and task construction. The benchmark curation process is technically sound. The detailed pipeline for constructing environments and task sequences provides valuable insight for future multimodal benchmark development. Comprehensive evaluation and

Weaknesses

Unclear motivation behind task combination. While each task individually provides meaningful evaluation, the rationale for combining M-Portal, M-Cube, and M-Maze into a single benchmark is not fully articulated. These settings are fairly distinct in format and objective, and similar concepts have appeared in prior embodied and spatial reasoning benchmarks. The paper would benefit from a stronger justification for why combining these tasks yields emergent value beyond scaling and aggregation. Ad

Reviewer 02Rating 4Confidence 4

Strengths

1. The paper is well written and easy to follow. 2. The three tasks probe different mixtures of perception, spatial reasoning, combinatorics, and rule-driven planning. The two-tier difficulty in each task (e.g., CUBE vs CUBE-easy, MAZE vs MAZE-easy) cleanly exposes where models fail (perception vs search vs dynamics). 3. The paper isolates perception with a conversion task (image to edge arrays), showing around 70–76% per-cell accuracy and 0% piece-level accuracy across models, which plausibly e

Weaknesses

1. Plan-correctness relies on mixing up to five independent mistake steps to produce 2^5 candidates with 1 positive, which is an extreme imbalance that can confound minority-class F1 and encourage shortcut cues in negatives. 2. Human results are reported from 2–3 experienced players; this small N, without variance/confidence intervals, makes it hard to contextualize model gaps especially on hard tasks. 4. While the image to array conversion task is informative, it’s still a single proxy. Conside

Reviewer 03Rating 2Confidence 4

Strengths

This benchmark is distinctive in its design, incorporating video game environments to construct datasets that evaluate the multimodal reasoning abilities of MLLMs. It features long-horizon tasks with large search spaces, providing a challenging testbed. The benchmark provides a comprehensive evaluation of various MLLMs, revealing that visual perception and planning remain critical bottlenecks in multimodal reasoning.

Weaknesses

The main focus of this work is on providing a benchmark to evaluate the capabilities of existing MLLMs. It feels largely engineering-oriented, emphasizing data curation rather than introducing new methods to enhance MLLM performance, which limits the conceptual contribution of the paper. While the overall writing quality is good, it could be made more concise. The evaluation setup appears somewhat specialized, focusing on a narrow subset of tasks based on simulated images, which may limit the

Code & Models

Datasets

mrble/MARBLE
dataset· 524 dl
524 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Constraint Satisfaction and Optimization · Spatial Cognition and Navigation