ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom
Jingqi Zhou, Sheng Wang, Jingwei Dong, Kai Liu, Lei Li, Jiahui Gao, Jiyue Jiang, Lingpeng Kong, Chuan Wu

TL;DR
ProReason introduces a novel multi-modal visual reasoning framework that separates visual perception from textual reasoning, enabling iterative proactive perception and improved performance on visual reasoning benchmarks.
Contribution
It proposes a decoupled vision-reasoning framework with multi-run proactive perception, enhancing multi-modal reasoning and integrating large language models effectively.
Findings
Outperforms existing multi-step reasoning frameworks with an average gain of 13.2%.
Enables high-quality visual reasoning data generation for downstream tasks.
Achieves superior performance on various benchmarks for open-source and closed-source models.
Abstract
Large vision-language models (LVLMs) have witnessed significant progress on visual understanding tasks. However, they often prioritize language knowledge over image information on visual reasoning tasks, incurring performance degradation. To tackle this issue, we first identify the drawbacks of existing solutions (i.e., limited multi-modal reasoning capacities, and insufficient and irrelevant visual descriptions). We then decompose visual reasoning process into two stages: proactive visual perception (i.e., eyesight) and textual reasoning (i.e., wisdom), and introduce a novel visual reasoning framework named ProReason. This framework features decoupled vision-reasoning capabilities and multi-run proactive perception. Briefly, given a multi-modal question, ProReason iterates proactive information collection and reasoning until the answer can be concluded with necessary and sufficient…
Peer Reviews
Decision·Submitted to ICLR 2025
+ The paper provides a detailed analysis of the limitations in existing models, highlighting how current LVLMs tend to rely more on language information than on visual cues. This analysis points out issues such as insufficient and irrelevant visual descriptions and limited multi-modal capabilities. + The proposed ProReason framework can iteratively generate proactive perception and effectively decouple vision and language capabilities. + Extensive results on four benchmarks demonstrate ProReason
- The authors claim to 'decouple' multi-modal reasoning into visual perception and textual reasoning. However, they do not provide evaluations on key aspects such as the frequency with which the dispatcher selects the Vision Expert versus the Reasoning Expert, the content of the generated memory from each expert, and the relevance between the memory generated by the Vision Expert and the Reasoning Expert and standard answers. - The authors compute relevance scores and evaluate caption effectiven
+ The proposed method presents an effective way to extract the necessary and sufficient visual details from images for the further multi-model reasoning step. + Extensive and comprehensive experiments demonstrate the superiority of the proposed method.
1) Sub-optimal Design and Assumption. In Sect. 2.2, the authors argue that a detailed caption of the given image cannot provide sufficient and relevant information for Visual Reasoning (VR). However, some works (e.g., [s1]) concentrate on optimizing the caption. It might be more reasonable to include an optimized caption for the proposed Action step rather than use Q-I as input. 2) Insufficient Experimental Evaluation. + Why isn't GPT-4V included for comparison? It is somewhat difficult to
I majorly conclude there are two strengths in this paper. The paper presents a compelling approach to separating visual perception from textual reasoning, addressing a fundamental limitation of current LVLMs. This decomposition allows for more effective handling of each capability and enables the integration of specialized models for different aspects of the task. The framework's ability to seamlessly integrate existing LLMs for improved reasoning capabilities is particularly valuable, as it a
There are two main weakness about this paper. 1. The paper doesn't thoroughly discuss how the framework handles cases where the Vision Expert and Reasoning Expert disagree or provide conflicting information. Suggestion: Add a section analyzing failure cases and how the framework handles conflicting information between agents. 2. The evaluation focuses on specific benchmark tasks, but doesn't extensively explore how the framework scales with increasing complexity of visual scenes or reasoning r
Videos
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Multi-Agent Systems and Negotiation
