Diagnosing Visual Reasoning: Challenges, Insights, and a Path Forward
Jing Bi, Guangyu Sun, Ali Vosoughi, Chen Chen, Chenliang Xu

TL;DR
This paper systematically diagnoses current vision-language models' failures in visual reasoning, proposing an agent-based architecture with visual modules that significantly improves performance and offers insights for future model development.
Contribution
It introduces a novel agent-based architecture combining LLM reasoning with visual modules and provides a comprehensive evaluation framework for diagnosing visual reasoning models.
Findings
Significant performance improvements (+10.3 on MMMU, +6.0 on MathVista)
Uncovered key failure modes in current models
Proposed a framework for iterative reasoning and analysis
Abstract
Multimodal large language models (MLLMs) that integrate visual and textual reasoning leverage chain-of-thought (CoT) prompting to tackle complex visual tasks, yet continue to exhibit visual hallucinations and an over-reliance on textual priors. We present a systematic diagnosis of state-of-the-art vision-language models using a three-stage evaluation framework, uncovering key failure modes. To address these, we propose an agent-based architecture that combines LLM reasoning with lightweight visual modules, enabling fine-grained analysis and iterative refinement of reasoning chains. Our results highlight future visual reasoning models should focus on integrating a broader set of specialized tools for analyzing visual content. Our system achieves significant gains (+10.3 on MMMU, +6.0 on MathVista over a 7B baseline), matching or surpassing much larger models. We will release our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Graph Neural Networks
