Qwen Look Again: Guiding Vision-Language Reasoning Models to Re-attention Visual Information
Xu Chu, Xinrong Chen, Guanyu Wang, Zhijie Tan, Kui Huang, Wenyu Lv, Tong Mo, Weiping Li

TL;DR
Qwen-LookAgain is a vision-language reasoning model that mitigates hallucinations by guiding the model to re-attend visual information during reasoning, using a reflection process and reinforcement learning to improve accuracy and reduce errors.
Contribution
It introduces a novel reflection-guided approach with reinforcement learning and visual token re-attention mechanisms to reduce hallucinations in vision-language models.
Findings
Achieves state-of-the-art accuracy on visual QA datasets.
Significantly reduces hallucination metrics compared to baseline models.
Demonstrates the effectiveness of visual token re-attention during reasoning.
Abstract
Inference time scaling drives extended reasoning to enhance the performance of Vision-Language Models (VLMs), thus forming powerful Vision-Language Reasoning Models (VLRMs). However, long reasoning dilutes visual tokens, causing visual information to receive less attention and may trigger hallucinations. Although introducing text-only reflection processes shows promise in language models, we demonstrate that it is insufficient to suppress hallucinations in VLMs. To address this issue, we introduce Qwen-LookAgain (Qwen-LA), a novel VLRM designed to mitigate hallucinations by incorporating a vision-text reflection process that guides the model to re-attention visual information during reasoning. We first propose a reinforcement learning method Balanced Reflective Policy Optimization (BRPO), which guides the model to decide when to generate vision-text reflection on its own and balance the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
MethodsSoftmax · Attention Is All You Need
