TL;DR
SeeingEye introduces an agent-based framework that enables text-only large language models to perform multimodal reasoning by translating visual inputs into structured representations, improving performance on visual question answering tasks.
Contribution
The paper presents a modular perception-reasoning framework that decouples visual perception from language reasoning, enabling existing text-only LLMs to handle multimodal tasks effectively.
Findings
Outperforms larger end-to-end vision-language models on VQA benchmarks.
Reduces inference cost compared to monolithic VLMs.
Enables scalable, plug-and-play multimodal reasoning with strong text-only LLMs.
Abstract
Recent advances in text-only large language models (LLMs), such as DeepSeek-R1, demonstrate remarkable reasoning ability. However, these models remain fragile or entirely incapable when extended to multi-modal tasks. Existing approaches largely rely on single-form captions, which lack diversity and often fail to adapt across different types of Visual Question Answering (VQA) benchmarks. As a result, they provide no principled or efficient channel for transmitting fine-grained visual information. We introduce Seeing Eye, a modular framework that unlocks multimodal reasoning in text-only LLMs through an agent-based small VLM translator. This translator acts as a perception agent: it can invoke specialized tools (e.g., OCR and crop) and iteratively distill multimodal inputs into structured intermediate representations (SIRs) tailored to the question. These SIRs are then passed to the…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The idea of building multi-agent systems for multimodal reasoning is a compelling direction. It is a natural path toward a form of collective intelligence where specialized agents collaborate to solve complex problems. This paper effectively explores how to combine a perception-focused agent (the VLM Translator) with a cognition-focused agent (the text-only LLM Reasoner), which is a promising paradigm for building more capable and interpretable AI systems. - The experimental results convincing
- Motivation: The paper's premise is that text-only LLMs possess reasoning capabilities superior to those found in monolithic VLMs, thus motivating the need for decoupling. However, this motivation is becoming less convincing, as state-of-the-art VLMs (e.g., Qwen3-VL) are typically built upon the most capable LLMs available at the time of their creation. The argument implicitly assumes a gap in reasoning ability that may not exist, or at least is not sufficiently justified. - Prompting LLMs to
* Separating sub-tasks makes sense but is also not novel. Previous published works in this direction are [Socratic Models, Zeng et al., ICLR'23] and [HAMMR, Castrejon et al., NeurIPS workshop 2024] and if you just go to LLMs using visual tools there are the seminal [VisProg CVPR’23] and [ViperGPT ICCV’23] papers which had quite a few follow-up papers. * Results seem fine w.r.t. the used baselines on MMMU and MMMU-Pro
* Giving LLMs visual capabilities with tool calls is not new. See references in point 1 of the strengths. * As baselines, only vanilla Qwen and GPT-4o are used. I do not see any attempts to use contemporary techniques with these models such as Chain-of-Thought versions or LLMs with tool calls. Just the vanilla versions. * The other baseline is OpenManus. There are currently many agentic frameworks around and it is hard to estimate how good a baseline this is by just reading this paper. * In thei
1. This paper aims to enhance visual reasoning capabilities, which is an important and widely recognized problem in the research community. 2. The paper is overall easy to follow. 3. The references are relatively comprehensive.
Although I believe this paper is technically solid overall, I have several concerns: 1. On novelty. The paper focuses on improving visual reasoning by **constructing prompting scaffolds** to enhance reasoning ability. However, similar ideas have been extensively explored in both the LLM and VLM literature, including but not limited to Prism [1]. While earlier designs may have been less sophisticated, the underlying idea is largely similar. A common issue with this line of work is that as VLMs t
• The organization of the paper is good, contains examples, and explanations. • The structure of the framework is clearly explained.
• The authors aim to unlock multimodal reasoning in Text-only LLMs; however, the LLM-based reasoner is just extracting textual information from the structured text. The main process is done with a VLM-based translator, which determines the visual part of the VQA question and converts it to a textual result. So, the success of the process depends on the translator-based VLM, which makes it not purely LLM reasoning. • The authors state that “the results highlight a scalable pathway to advanced m
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
