Perceptual Flow Network for Visually Grounded Reasoning

Yangfu Li; Yuning Gong; Hongjian Zhan; Teng Li; Yuanhuiyi Lyu; Tianyi Chen; Qi Liu; Ziyuan Huang; Zhihang Zhong; Dandan Zheng; Yue Lu

arXiv:2605.02730·cs.CV·May 5, 2026

Perceptual Flow Network for Visually Grounded Reasoning

Yangfu Li, Yuning Gong, Hongjian Zhan, Teng Li, Yuanhuiyi Lyu, Tianyi Chen, Qi Liu, Ziyuan Huang, Zhihang Zhong, Dandan Zheng, Yue Lu

PDF

TL;DR

PFlowNet is a novel perceptual flow network that improves visual reasoning in LVLMs by decoupling perception from reasoning and using reinforcement learning, achieving state-of-the-art results.

Contribution

It introduces a self-conditioned generation process and a reinforcement learning framework to enhance visual reasoning without relying on rigid geometric priors.

Findings

01

Sets new SOTA on V* Bench with 90.6%

02

Achieves 67.0% on MME-RealWorld-lite

03

Provides a performance guarantee for visual reasoning

Abstract

Despite the success of Large-Vision Language Models (LVLMs), general optimization objectives (e.g., standard MLE) fail to constrain visual trajectories, leading to language bias and hallucination. To mitigate this, current methods introduce geometric priors from visual experts as additional supervision. However, we observe that such supervision is typically suboptimal: it is biased toward geometric precision and offers limited reasoning utility. To bridge this gap, we propose Perceptual Flow Network (PFlowNet), which eschews rigid alignment with the expert priors and achieves interpretable yet more effective visual reasoning. Specifically, PFlowNet decouples perception from reasoning to establish a self-conditioned generation process. Based on this, it integrates multi-dimensional rewards with vicinal geometric shaping via variational reinforcement learning, thereby facilitating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.