InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search
Kaican Li, Lewei Yao, Jiannan Wu, Tiezheng Yu, Jierun Chen, Haoli Bai, Lu Hou, Lanqing Hong, Wei Zhang, Nevin L. Zhang

TL;DR
This paper introduces InSight-o3, a multi-agent framework with a visual search component that enhances multimodal reasoning capabilities of foundation models, demonstrated through a new challenging benchmark called O3-Bench.
Contribution
It proposes a novel multi-agent approach with a visual search agent and a specialized multimodal LLM, advancing open multimodal systems' reasoning abilities.
Findings
O3-Bench is highly challenging, with frontier systems achieving only 40.8% accuracy.
The visual search agent significantly improves model performance across benchmarks.
The framework marks progress towards more capable open multimodal AI systems.
Abstract
The ability for AI agents to "think with images" requires a sophisticated blend of reasoning and perception. However, current open multimodal agents still largely fall short on the reasoning aspect crucial for real-world tasks like analyzing documents with dense charts/diagrams and navigating maps. To address this gap, we introduce O3-Bench, a new benchmark designed to evaluate multimodal reasoning with interleaved attention to visual details. O3-Bench features challenging problems that require agents to piece together subtle visual information from distinct image areas through multi-step reasoning. The problems are highly challenging even for frontier systems like OpenAI o3, which only obtains 40.8% accuracy on O3-Bench. To make progress, we propose InSight-o3, a multi-agent framework consisting of a visual reasoning agent (vReasoner) and a visual search agent (vSearcher) for which we…
Peer Reviews
Decision·ICLR 2026 Poster
* This paper decouples visual reasoning from visual search, leading to a modular system which is more understandable and whose components can be trained independently. * Results are state-of-the-art. * Good improvement using Gemini as a visual reasoner suggests that even though their visual search model was optimized on GPT-5-mini, it generalizes to multiple visual reasoners. * Ablation shows good improvement of RL fine-tuning and training efficiency benefit of using their static RL setup in con
* The new dataset is rather small and limited in domain. * Minor: Table 1 is a bit confusing: the bottom part seems the most important while the top part is hardly discussed and only used for context; I would suggest to show the bottom part either on top or maybe better shown separately as Figure 1. * Minor: it is unclear which tools are used by Qwen2.5-VL. Anything other than 'crop'? * Minor: specialization of agents and sub-agents in agentic frameworks has been shown to work in prior art. Exam
1. The paper clearly identifies a specific weakness in current multimodal agents—their inability to perform complex reasoning that requires integrating fine-grained visual details. It presents a challenging benchmark (o3-bench) designed explicitly to measure this underdeveloped capability. 2. The proposed InSight-o3 framework presents a sophisticated multi-agent architecture that decomposes the complex problem into specialized sub-tasks (reasoning and search).
The claim that the framework's search steps decrease with increasing resolution is not sufficiently supported, as the reported variations across resolutions are minimal. This suggests the search pattern may be overly reliant on the characteristics of the training data, raising concerns about its scalability and effectiveness in real-world, multi-step search-and-reasoning tasks involving high-resolution images. The framework's performance on powerful yet tool-agnostic models like GPT-4o and Gemi
(1) It is a promising direction to improve multi-modal reasoning models by incorporating external tools (i.e., the visual search module in this case), which could provide both performance boost and enhanced transparency. (2) The use of collages for visual search training reduces the reliance on large-scale naturalistic data. (3) The paper develops a new evaluation benchmark for multi-modal reasoning, and can benefit the development of subsequent models. (4) The proposed method shows generaliz
(1) It is not a new idea to combine VLMs with external tools (e.g., some compositional reasoning models [ref1, ref2] already explore the tool usage with reinforcement learning). The paper experiments with a single tool (i.e., visual search framed as a visual grounding task), while solving real-life problems could require diverse abilities. It is unclear whether visual search (especially when trained independently) can help generalize reasoning across different scenarios. (2) One advantage of h
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Advanced Neural Network Applications
