Qwen Look Again: Guiding Vision-Language Reasoning Models to Re-attention Visual Information

Xu Chu; Xinrong Chen; Guanyu Wang; Zhijie Tan; Kui Huang; Wenyu Lv; Tong Mo; Weiping Li

arXiv:2505.23558·cs.CV·June 2, 2025

Qwen Look Again: Guiding Vision-Language Reasoning Models to Re-attention Visual Information

Xu Chu, Xinrong Chen, Guanyu Wang, Zhijie Tan, Kui Huang, Wenyu Lv, Tong Mo, Weiping Li

PDF

Open Access 1 Repo

TL;DR

Qwen-LookAgain is a vision-language reasoning model that mitigates hallucinations by guiding the model to re-attend visual information during reasoning, using a reflection process and reinforcement learning to improve accuracy and reduce errors.

Contribution

It introduces a novel reflection-guided approach with reinforcement learning and visual token re-attention mechanisms to reduce hallucinations in vision-language models.

Findings

01

Achieves state-of-the-art accuracy on visual QA datasets.

02

Significantly reduces hallucination metrics compared to baseline models.

03

Demonstrates the effectiveness of visual token re-attention during reasoning.

Abstract

Inference time scaling drives extended reasoning to enhance the performance of Vision-Language Models (VLMs), thus forming powerful Vision-Language Reasoning Models (VLRMs). However, long reasoning dilutes visual tokens, causing visual information to receive less attention and may trigger hallucinations. Although introducing text-only reflection processes shows promise in language models, we demonstrate that it is insufficient to suppress hallucinations in VLMs. To address this issue, we introduce Qwen-LookAgain (Qwen-LA), a novel VLRM designed to mitigate hallucinations by incorporating a vision-text reflection process that guides the model to re-attention visual information during reasoning. We first propose a reinforcement learning method Balanced Reflective Policy Optimization (BRPO), which guides the model to decide when to generate vision-text reflection on its own and balance the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

liar406/look_again
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsSoftmax · Attention Is All You Need