See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection

Zhiheng Wu; Tong Wang; Shuning Wang; Naiming Liu; Yumeng Zhang

arXiv:2604.24339·cs.CV·April 28, 2026

See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection

Zhiheng Wu, Tong Wang, Shuning Wang, Naiming Liu, Yumeng Zhang

PDF

TL;DR

This paper introduces ForeSight, a multimodal reasoning framework for VLMs that incorporates low-level visual cues and visual reflection, improving reasoning accuracy and surpassing some state-of-the-art models.

Contribution

The paper presents a novel framework combining low-level visual tools and visual feedback mechanisms, trained with reinforcement learning, to enhance VLM reasoning capabilities.

Findings

01

ForeSight-7B outperforms other models of the same size.

02

The framework surpasses some current SOTA closed-source models.

03

Constructed CG-SalBench dataset for evaluation.

Abstract

Recent advances in Vision-Language Models (VLMs) have benefited from Reinforcement Learning (RL) for enhanced reasoning. However, existing methods still face critical limitations, including the lack of low-level visual information and effective visual feedback. To address these problems, this paper proposes a unified multimodal interleaved reasoning framework \textbf{ForeSight}, which enables VLMs to \textbf{See Further} with low-level visual cues and \textbf{Think Deeper} with effective visual feedback. First, it introduces a set of low-level visual tools to integrate essential visual information into the reasoning chain, mitigating the neglect of fine-grained visual features. Second, a mask-based visual feedback mechanism is elaborated to incorporate visual reflection into the thinking process, enabling the model to dynamically re-examine and update its answers. Driven by RL,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.