MaRVL-QA: A Benchmark for Mathematical Reasoning over Visual Landscapes
Nilay Pande, Sahiti Yerramilli, Jayant Sravan Tamarapalli, Rynaa Grover

TL;DR
MaRVL-QA is a new benchmark designed to evaluate mathematical and spatial reasoning in multimodal models using surface plots, revealing current models' limitations and guiding future improvements.
Contribution
Introduces MaRVL-QA, a benchmark with novel tasks for assessing deep reasoning over visual mathematical landscapes in multimodal models.
Findings
State-of-the-art models perform poorly on the benchmark.
Models tend to rely on superficial heuristics rather than true reasoning.
MaRVL-QA exposes significant gaps in current model capabilities.
Abstract
A key frontier for Multimodal Large Language Models (MLLMs) is the ability to perform deep mathematical and spatial reasoning directly from images, moving beyond their established success in semantic description. Mathematical surface plots provide a rigorous testbed for this capability, as they isolate the task of reasoning from the semantic noise common in natural images. To measure progress on this frontier, we introduce MaRVL-QA (Mathematical Reasoning over Visual Landscapes), a new benchmark designed to quantitatively evaluate these core reasoning skills. The benchmark comprises two novel tasks: Topological Counting, identifying and enumerating features like local maxima; and Transformation Recognition, recognizing applied geometric transformations. Generated from a curated library of functions with rigorous ambiguity filtering, our evaluation on MaRVL-QA reveals that even…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
1. Great idea overall. 2. Clear task definitions. 3. Good analysis of model behavior: The paper doesn’t just report accuracy; it actually looks at how models fail. For example, some models always guess the same rotation or default to “No Change” when uncertain. Those observations clearly show where reasoning breaks down and make the results more insightful. 4. Potential for extensibility: The idea (function-plot generator) could produce many more different controlled tasks.
I want to call this sections "Potential Improvements" rather than weaknesses. And this is the part where I want to say most of the stuff I think needed to be discussed. ### **1) Dataset maturity and consistency:** The sample provided with the paper includes 100 examples per task. “TopologicalCounting” files include labels in their names (e.g., example_001_inferno_heatmap), while “TransformationRecognition” uses random UUIDs. Plus "“TopologicalCounting” " comes with a detailed config.json file w
* The benchmark's premise is highly original and insightful. While many benchmarks test math reasoning (e.g., GSM8K) or chart QA (e.g., ChartQA), MaRVL-QA is the first I have seen to so effectively isolate foundational geometric and topological reasoning from high-level semantics. Using semantically-sparse "visual landscapes" as a diagnostic tool is a novel and powerful idea. * The methodological quality of the benchmark's construction is a significant strength. The decision to hold axis labels
* The evaluation includes a strong set of SOTA closed-source models but is very weak on the open-source side. It primarily features the LLaVA family (which are now several years old and known to be poor at this) and one Qwen model. The LLaVA models perform at or near random chance, offering little insight beyond "they can't do this at all." Including a wider range of modern, capable open-source MLLMs (e.g., InternVL, newer Llama-V) would be necessary to claim these failures are universal and not
The paper was clear and fluid to understand. Majorly, the strengths of the paper include: 1) Failure mode analysis: The study's dissection of failure modes - catastrophic, near-miss, heuristic collapse and bias profiles - maxima and minima salience, rotation, translation confusion. 2) Ambiguity Filtering: The paper employs algorithmic ambiguity filtering based on normalised RMSE thresholds, explicit rejection of symmetric or visually confounding transformations, and prominence-based feature va
Mainly, I have one weakness to point out, which is not necessary to fulfil in immediacy: 1) Synthetic Task Design: The benchmark relies on programmatically designed mathematical surface plots, which cover most aspects of visual reasoning, but lack in noise and the complexities of real-world visual reasoning.
- MaRVL-QA provides a well-defined, controllable testbed for visual–mathematical reasoning over function plots. - Two-way ambiguity filtering (e.g., excluding symmetric cases; distinguishing rotations vs. translations) and manual review improve label reliability. - Confidence intervals are provided; format robustness is probed by comparing an LLM parser to a rule-based extractor.
- Synthetic, stylized plots may diverge from real scientific/engineering figures (e.g., noisy measurement fields, varied projections, non-uniform grids). The generated plots are mostly all in one distribution, which limited the representativeness of MaRVL-QA. - The benchmark is limited to a narrow set of skills, which might over-fit models toward a few primitives. - The correlation analysis in Section 4.4 needs more explanation. - Some important and recent models are missing from the baseli
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
