See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection
Zhiheng Wu, Tong Wang, Shuning Wang, Naiming Liu, Yumeng Zhang

TL;DR
This paper introduces ForeSight, a multimodal reasoning framework for VLMs that incorporates low-level visual cues and visual reflection, improving reasoning accuracy and surpassing some state-of-the-art models.
Contribution
The paper presents a novel framework combining low-level visual tools and visual feedback mechanisms, trained with reinforcement learning, to enhance VLM reasoning capabilities.
Findings
ForeSight-7B outperforms other models of the same size.
The framework surpasses some current SOTA closed-source models.
Constructed CG-SalBench dataset for evaluation.
Abstract
Recent advances in Vision-Language Models (VLMs) have benefited from Reinforcement Learning (RL) for enhanced reasoning. However, existing methods still face critical limitations, including the lack of low-level visual information and effective visual feedback. To address these problems, this paper proposes a unified multimodal interleaved reasoning framework \textbf{ForeSight}, which enables VLMs to \textbf{See Further} with low-level visual cues and \textbf{Think Deeper} with effective visual feedback. First, it introduces a set of low-level visual tools to integrate essential visual information into the reasoning chain, mitigating the neglect of fine-grained visual features. Second, a mask-based visual feedback mechanism is elaborated to incorporate visual reflection into the thinking process, enabling the model to dynamically re-examine and update its answers. Driven by RL,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
