TL;DR
VisioMath introduces a challenging benchmark for evaluating large multimodal models' ability to perform fine-grained reasoning over visually similar diagrams in mathematics, revealing current limitations and potential improvements.
Contribution
The paper presents VisioMath, a new benchmark with 1,800 math problems involving subtle diagram differences, and analyzes LMMs' performance and failure modes on this task.
Findings
Models' accuracy declines with increased image similarity.
Image-text misalignment causes systematic errors.
Alignment strategies improve model performance.
Abstract
Large Multimodal Models have achieved remarkable progress in integrating vision and language, enabling strong performance across perception, reasoning, and domain-specific tasks. However, their capacity to reason over multiple, visually similar inputs remains insufficiently explored. Such fine-grained comparative reasoning is central to real-world tasks, especially in mathematics and education, where learners must often distinguish between nearly identical diagrams to identify correct solutions. To address this gap, we present VisioMath, a curated benchmark of 1,800 high-quality K-12 mathematics problems in which all candidate answers are diagrams with subtle visual similarities. A comprehensive evaluation of state-of-the-art LMMs, covering both leading closed-source systems and widely adopted open-source models, reveals a consistent decline in accuracy as inter-image similarity…
Peer Reviews
Decision·ICLR 2026 Poster
1. **Novel Benchmark Design for Visual Reasoning in Math** The benchmark uniquely incorporates **both image-based question stems and diagrammatic answer options**, a setting that is rarely seen in prior math benchmarks. This integration of text and diagrams offers a more realistic and cognitively demanding setup, and I believe such *interleaved visual–textual formats* represent an important future direction for multimodal reasoning benchmarks. 2. **High-Quality, Real-World Data Source
1. **Restricted Evaluation Format (Multiple-Choice Only)** Every question in VisioMath follows a **four-choice multiple-choice format**, which simplifies the reasoning process and limits evaluation to **discrete answer selection**. This design makes it difficult to assess open-ended or step-by-step reasoning ability, which is essential for understanding the full reasoning depth of large multimodal models. 2. **Lack of Failure Case Analysis Beyond Accuracy** As a benchmark paper,
* Originality and Significance: The paper identifies and rigorously tests a novel, practical, and highly significant problem. The task of "figure-based option" reasoning, especially with high-similarity distractors, is a ubiquitous real-world scenario (especially in STEM education) that has been almost entirely overlooked by existing benchmarks. * Benchmark Quality: The VisioMath benchmark is meticulously constructed and serves as an excellent diagnostic tool, not just a leaderboard. The curatio
* Apparent Contradiction in Concatenation Strategy: There is a confusing contradiction in the results. The baseline evaluation (Table 2) shows that single-image LMMs, which use a "composite image concatenation strategy," perform at random-guess levels (e.g., LLaVA-v1.6 at 24.4%). However, "Strategy 1," which also uses a "consolidated single image layout" (concatenation), *improves* the performance of multi-image LMMs (e.g., +6.4% for Seed1.6-Thinking). The paper does not adequately explain why c
- The paper identifies and addresses a critical problem in the field of LMM evaluation by focusing on visually similar diagrammatic options. - Through the quantification of visual similarity and an innovative "option shuffling" experiment, it uncovers that the core weakness of current LMMs lies in precise image-text alignment. - It proposes three concrete and viable performance enhancement strategies ranging from training-free methods to lightweight fine-tuning and experimentally demonstrates
- The definition of visual similarity relies entirely on a single model (Qwen multimodal-embedding-v1). Different models' visual encoders may have varying interpretations of "similarity." A brief discussion or experiment demonstrating the correlation or discrepancy with other metrics (such as CLIP embeddings or DINO scores), or one that further justifies this choice, would make this core metric more robust. - The fine-tuning experiment for Strategy 3 was conducted on only a single model (Qwen2.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
