Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models
Yufei Zhan, Hongyin Zhao, Yousong Zhu, Shurong Zheng, Fan Yang, Ming Tang, Jinqiao Wang

TL;DR
This paper introduces Griffon-R, a unified visual reasoning model that mimics human-like understanding and reasoning processes, significantly improving performance on complex visual reasoning tasks and benchmarks.
Contribution
It proposes a novel, single-pass, human-like reasoning mechanism for large multimodal models, bridging visual understanding and question answering without external tools.
Findings
Achieves state-of-the-art results on VSR and CLEVR benchmarks.
Enhances multimodal capabilities on MMBench and ScienceQA.
Supports end-to-end automatic understanding and reasoning.
Abstract
Large Multimodal Models (LMMs) have recently demonstrated remarkable visual understanding performance on both vision-language and vision-centric tasks. However, they often fall short in integrating advanced, task-specific capabilities for compositional reasoning, which hinders their progress toward truly competent general vision models. To address this, we present a unified visual reasoning mechanism that enables LMMs to solve complicated compositional problems by leveraging their intrinsic capabilities (e.g. grounding and visual understanding capabilities). Different from the previous shortcut learning mechanism, our approach introduces a human-like understanding-thinking-answering process, allowing the model to complete all steps in a single pass forwarding without the need for multiple inferences or external tools. This design bridges the gap between foundational visual capabilities…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The results in Table 2 were strong.
- The "Understand-Think-Answer" approach is a relatively common prompt engineering approach and is not novel nor a significant research contribution. - The authors add a new dataset, test their fine-tuned model on data similar to their new dataset, and report strong results. This likely results in overfitting to their fine-tuned dataset and forgetting of old information, which should be tested for. - The benchmark comparisons made are unfair as they compare a model with more data that has been
The paper provides enough details to help the readers understand the full story. The experimental results cover multiple domains.
1. No novel methodology or insights proposed: the unified framework features no significant difference compared to widely-used CoTs for VLMs. The motivation of unifying visual reasoning format for different tasks is unclear, as special tasks (math, coding, ...) need special CoT formats, and advanced LLMs can discover new formats via RL. 2. The compared models and benchmarks are severely outdated for a work in 2025. For example,they used Qwen-2-vl for data annotation, while Qwen2.5-vl is released
1. The work pushes forward end-to-end compositional reasoning, a core obstacle in vision-language integration. It establishes a strong empirical and conceptual foundation for future intrinsically capable multimodal reasoning models, potentially influencing both academic and applied LMM research. 2. The paper provides a high-quality dataset (334K samples) created through a semi-automatic expert-supervised process — a valuable contribution for the community.
1. The paper lacks a formal theoretical analysis and ablation experiments quantifying the specific contribution of each UTA phase. 2. Generalization to out-of-domain or noisy visual data remains unclear. 3. Many references are incorrectly formatted or missing spaces between citations and text, which affects readability and professionalism.
Well-motivated approach: The "understand-think-answer" paradigm is intuitive and addresses real limitations of current LMMs in compositional reasoning. The motivation is clearly articulated with concrete examples (Figure 1). Efficiency advantage: Completing all reasoning steps in a single forward pass without external tools is more efficient than toolkit-based methods. Table 4 shows 13x speedup over SEAL while maintaining competitive accuracy. Strong empirical results: Griffon-R achieves state
1. Limited Technical Novelty The core contribution appears to be more about structured data annotation than a novel reasoning mechanism: The "understanding" step simply prompts the model to use existing capabilities (grounding, captioning, OCR) it already possesses The "thinking" step is self-prompting based on context, which is well-explored in CoT literature The main innovation is in training data format rather than architecture or fundamental mechanisms 2. Insufficient Detail on Data Ann
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Educational Games and Gamification
