MaRVL-QA: A Benchmark for Mathematical Reasoning over Visual Landscapes

Nilay Pande; Sahiti Yerramilli; Jayant Sravan Tamarapalli; Rynaa Grover

arXiv:2508.17180·cs.AI·September 10, 2025

MaRVL-QA: A Benchmark for Mathematical Reasoning over Visual Landscapes

Nilay Pande, Sahiti Yerramilli, Jayant Sravan Tamarapalli, Rynaa Grover

PDF

1 Datasets 4 Reviews

TL;DR

MaRVL-QA is a new benchmark designed to evaluate mathematical and spatial reasoning in multimodal models using surface plots, revealing current models' limitations and guiding future improvements.

Contribution

Introduces MaRVL-QA, a benchmark with novel tasks for assessing deep reasoning over visual mathematical landscapes in multimodal models.

Findings

01

State-of-the-art models perform poorly on the benchmark.

02

Models tend to rely on superficial heuristics rather than true reasoning.

03

MaRVL-QA exposes significant gaps in current model capabilities.

Abstract

A key frontier for Multimodal Large Language Models (MLLMs) is the ability to perform deep mathematical and spatial reasoning directly from images, moving beyond their established success in semantic description. Mathematical surface plots provide a rigorous testbed for this capability, as they isolate the task of reasoning from the semantic noise common in natural images. To measure progress on this frontier, we introduce MaRVL-QA (Mathematical Reasoning over Visual Landscapes), a new benchmark designed to quantitatively evaluate these core reasoning skills. The benchmark comprises two novel tasks: Topological Counting, identifying and enumerating features like local maxima; and Transformation Recognition, recognizing applied geometric transformations. Generated from a curated library of functions with rigorous ambiguity filtering, our evaluation on MaRVL-QA reveals that even…

Peer Reviews

Decision·ICLR 2026 Conference Desk Rejected Submission

Reviewer 01Rating 2Confidence 5

Strengths

1. Great idea overall. 2. Clear task definitions. 3. Good analysis of model behavior: The paper doesn’t just report accuracy; it actually looks at how models fail. For example, some models always guess the same rotation or default to “No Change” when uncertain. Those observations clearly show where reasoning breaks down and make the results more insightful. 4. Potential for extensibility: The idea (function-plot generator) could produce many more different controlled tasks.

Weaknesses

I want to call this sections "Potential Improvements" rather than weaknesses. And this is the part where I want to say most of the stuff I think needed to be discussed. ### **1) Dataset maturity and consistency:** The sample provided with the paper includes 100 examples per task. “TopologicalCounting” files include labels in their names (e.g., example_001_inferno_heatmap), while “TransformationRecognition” uses random UUIDs. Plus "“TopologicalCounting” " comes with a detailed config.json file w

Reviewer 02Rating 6Confidence 4

Strengths

* The benchmark's premise is highly original and insightful. While many benchmarks test math reasoning (e.g., GSM8K) or chart QA (e.g., ChartQA), MaRVL-QA is the first I have seen to so effectively isolate foundational geometric and topological reasoning from high-level semantics. Using semantically-sparse "visual landscapes" as a diagnostic tool is a novel and powerful idea. * The methodological quality of the benchmark's construction is a significant strength. The decision to hold axis labels

Weaknesses

* The evaluation includes a strong set of SOTA closed-source models but is very weak on the open-source side. It primarily features the LLaVA family (which are now several years old and known to be poor at this) and one Qwen model. The LLaVA models perform at or near random chance, offering little insight beyond "they can't do this at all." Including a wider range of modern, capable open-source MLLMs (e.g., InternVL, newer Llama-V) would be necessary to claim these failures are universal and not

Reviewer 03Rating 8Confidence 4

Strengths

The paper was clear and fluid to understand. Majorly, the strengths of the paper include: 1) Failure mode analysis: The study's dissection of failure modes - catastrophic, near-miss, heuristic collapse and bias profiles - maxima and minima salience, rotation, translation confusion. 2) Ambiguity Filtering: The paper employs algorithmic ambiguity filtering based on normalised RMSE thresholds, explicit rejection of symmetric or visually confounding transformations, and prominence-based feature va

Weaknesses

Mainly, I have one weakness to point out, which is not necessary to fulfil in immediacy: 1) Synthetic Task Design: The benchmark relies on programmatically designed mathematical surface plots, which cover most aspects of visual reasoning, but lack in noise and the complexities of real-world visual reasoning.

Reviewer 04Rating 2Confidence 4

Strengths

- MaRVL-QA provides a well-defined, controllable testbed for visual–mathematical reasoning over function plots. - Two-way ambiguity filtering (e.g., excluding symmetric cases; distinguishing rotations vs. translations) and manual review improve label reliability. - Confidence intervals are provided; format robustness is probed by comparing an LLM parser to a rule-based extractor.

Weaknesses

- Synthetic, stylized plots may diverge from real scientific/engineering figures (e.g., noisy measurement fields, varied projections, non-uniform grids). The generated plots are mostly all in one distribution, which limited the representativeness of MaRVL-QA. - The benchmark is limited to a narrow set of skills, which might over-fit models toward a few primitives. - The correlation analysis in Section 4.4 needs more explanation. - Some important and recent models are missing from the baseli

Code & Models

Datasets

sahitiy51/MaRVL-QA
dataset· 87 dl
87 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.