SeeingEye: Agentic Information Flow Unlocks Multimodal Reasoning In Text-only LLMs

Weijia Zhang; Zijia Liu; Haoru Li; Haoqi Chen; Jiaxuan You

arXiv:2510.25092·cs.MA·October 30, 2025

SeeingEye: Agentic Information Flow Unlocks Multimodal Reasoning In Text-only LLMs

Weijia Zhang, Zijia Liu, Haoru Li, Haoqi Chen, Jiaxuan You

PDF

4 Reviews

TL;DR

SeeingEye introduces an agent-based framework that enables text-only large language models to perform multimodal reasoning by translating visual inputs into structured representations, improving performance on visual question answering tasks.

Contribution

The paper presents a modular perception-reasoning framework that decouples visual perception from language reasoning, enabling existing text-only LLMs to handle multimodal tasks effectively.

Findings

01

Outperforms larger end-to-end vision-language models on VQA benchmarks.

02

Reduces inference cost compared to monolithic VLMs.

03

Enables scalable, plug-and-play multimodal reasoning with strong text-only LLMs.

Abstract

Recent advances in text-only large language models (LLMs), such as DeepSeek-R1, demonstrate remarkable reasoning ability. However, these models remain fragile or entirely incapable when extended to multi-modal tasks. Existing approaches largely rely on single-form captions, which lack diversity and often fail to adapt across different types of Visual Question Answering (VQA) benchmarks. As a result, they provide no principled or efficient channel for transmitting fine-grained visual information. We introduce Seeing Eye, a modular framework that unlocks multimodal reasoning in text-only LLMs through an agent-based small VLM translator. This translator acts as a perception agent: it can invoke specialized tools (e.g., OCR and crop) and iteratively distill multimodal inputs into structured intermediate representations (SIRs) tailored to the question. These SIRs are then passed to the…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 3

Strengths

- The idea of building multi-agent systems for multimodal reasoning is a compelling direction. It is a natural path toward a form of collective intelligence where specialized agents collaborate to solve complex problems. This paper effectively explores how to combine a perception-focused agent (the VLM Translator) with a cognition-focused agent (the text-only LLM Reasoner), which is a promising paradigm for building more capable and interpretable AI systems. - The experimental results convincing

Weaknesses

- Motivation: The paper's premise is that text-only LLMs possess reasoning capabilities superior to those found in monolithic VLMs, thus motivating the need for decoupling. However, this motivation is becoming less convincing, as state-of-the-art VLMs (e.g., Qwen3-VL) are typically built upon the most capable LLMs available at the time of their creation. The argument implicitly assumes a gap in reasoning ability that may not exist, or at least is not sufficiently justified. - Prompting LLMs to

Reviewer 02Rating 2Confidence 3

Strengths

* Separating sub-tasks makes sense but is also not novel. Previous published works in this direction are [Socratic Models, Zeng et al., ICLR'23] and [HAMMR, Castrejon et al., NeurIPS workshop 2024] and if you just go to LLMs using visual tools there are the seminal [VisProg CVPR’23] and [ViperGPT ICCV’23] papers which had quite a few follow-up papers. * Results seem fine w.r.t. the used baselines on MMMU and MMMU-Pro

Weaknesses

* Giving LLMs visual capabilities with tool calls is not new. See references in point 1 of the strengths. * As baselines, only vanilla Qwen and GPT-4o are used. I do not see any attempts to use contemporary techniques with these models such as Chain-of-Thought versions or LLMs with tool calls. Just the vanilla versions. * The other baseline is OpenManus. There are currently many agentic frameworks around and it is hard to estimate how good a baseline this is by just reading this paper. * In thei

Reviewer 03Rating 2Confidence 4

Strengths

1. This paper aims to enhance visual reasoning capabilities, which is an important and widely recognized problem in the research community. 2. The paper is overall easy to follow. 3. The references are relatively comprehensive.

Weaknesses

Although I believe this paper is technically solid overall, I have several concerns: 1. On novelty. The paper focuses on improving visual reasoning by **constructing prompting scaffolds** to enhance reasoning ability. However, similar ideas have been extensively explored in both the LLM and VLM literature, including but not limited to Prism [1]. While earlier designs may have been less sophisticated, the underlying idea is largely similar. A common issue with this line of work is that as VLMs t

Reviewer 04Rating 2Confidence 3

Strengths

• The organization of the paper is good, contains examples, and explanations. • The structure of the framework is clearly explained.

Weaknesses

• The authors aim to unlock multimodal reasoning in Text-only LLMs; however, the LLM-based reasoner is just extracting textual information from the structured text. The main process is done with a VLM-based translator, which determines the visual part of the VQA question and converts it to a textual result. So, the success of the process depends on the translator-based VLM, which makes it not purely LLM reasoning. • The authors state that “the results highlight a scalable pathway to advanced m

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.