When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought
Yiyang Zhou, Haoqin Tu, Zijun Wang, Zeyu Wang, Niklas Muennighoff, Fan Nie, Yejin Choi, James Zou, Chaorui Deng, Shen Yan, Haoqi Fan, Cihang Xie, Huaxiu Yao, Qinghao Ye

TL;DR
MIRA is a benchmark that evaluates models' ability to generate and utilize intermediate visual images for reasoning, highlighting the importance of visual cues in complex problem-solving tasks.
Contribution
This paper introduces MIRA, a multimodal benchmark with a new evaluation protocol emphasizing intermediate visual reasoning, and demonstrates the limitations of current models relying solely on text.
Findings
Models improve performance with visual cues, with an average gain of 33.7%.
Existing models perform poorly with text-only prompts on visual reasoning tasks.
Visual-CoT prompts provide limited improvements over textual prompts alone.
Abstract
We propose MIRA, a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning. Unlike traditional CoT methods that rely solely on text, tasks in MIRA require models to generate and utilize intermediate images - such as sketches, structural diagrams, or path drawings - to guide their reasoning process. This setup closely mirrors how humans solve complex problems through "drawing to think". To solve this, MIRA focuses on tasks that are intrinsically challenging and involve complex structures, spatial relationships, or reasoning steps that are difficult to express through language alone. To ensure that our evaluation data is of high-quality, we include 546 multimodal problems, annotated with intermediate visual images and final answers. We also propose a unified evaluation protocol for MIRA that spans three…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The paper successfully identifies shortcomings in existing datasets and makes a convincing argument for why it is needed. - The pipeline for data generation and for evaluation are set up well, providing some degree of trust in the dataset. - The results on the benchmark suggest new directions for the field to explore, highlighting types of questions that remain unsolved even by strong closed models.
- Tool-augmented methods are mentioned as part of the motivation (page 2) but not evaluated on the benchmark as far as I can tell. It would be interesting to see these results. - 4.2 suggests Visual-CoT data may be a way forward with these types of questions, but these need to be manually created if I understand correctly. In what contexts would they actually be helpful given new problems won't have them at inference time, or will need to be annotated for that specifically? When is it easier for
1. The paper proposes an interesting scenario that requires the generation of intermediate visual images to solve reasoning tasks. 2. The curated intermediate CoT images provided in the dataset are of high quality.
1. **Inconsistent Answer Formats:** The benchmark employs a wide variety of answer formats, including multiple-choice, free-form text, numeric values, lists, and even custom coordinate-based formats (e.g., the localizer task). This inconsistency makes it difficult to perform robust statistical analysis or establish a consistent "random guess" baseline for comparison across the diverse task types. 2. **Unclear Narrative and Analysis:** The paper's core objective is not clearly articulated; it re
1. The proposed three-level evaluation protocol is a major strength. By systematically comparing Direct Evaluation, Text-CoT, and Simulated Visual-CoT, it provides clear evidence for their central claims. 2. The authors have conducted an extensive evaluation across a large and representative set of MLLMs, including top-tier closed-source models and various open-weight alternatives.
1. **Limited Scale and Generality of the Benchmark**: The size of this benchmark (546 examples) is relatively modest, which could limit the statistical power of the conclusions. Besides, it remains an open question how well the findings would generalize to more common, real-world visual reasoning scenarios that are less puzzle-like. 2. **Lack of a Random Baseline**: The paper does not provide a random-chance baseline for its tasks.With reported accuracies in most tasks under 20%, it is difficu
1. This paper introduces high-quality benchmark with human labeling and inspection. The benchmark spans 20 task types and includes 546 carefully designed examples. On this benchmark, current models perform poorly, indicating that it presents a challenging set of tasks for existing models. 2. By providing well-designed text CoTs and visual CoTs, the authors tested the upper performance of current models on this task. Even with high-quality simulated Visual-CoT reasoning (manually annotated inter
1. As a benchmark-oriented study, and the process does not involve automated construction by models, it may not align closely with ICLR’s primary interests. It would likely be more suitable for the data track. 2. In addition to providing text-CoT prompts and simulated visual CoTs for questions to test the model’s upper-bound performance under high-quality CoT conditions, there doesn’t seem to be a fundamental difference between this work and other multi-modality reasoning benchmarks, such as [1
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Data Visualization and Analytics · Generative Adversarial Networks and Image Synthesis
