Look-Back: Implicit Visual Re-focusing in MLLM Reasoning
Shuo Yang, Yuwei Niu, Yuyang Liu, Yang Ye, Bin Lin, Li Yuan

TL;DR
This paper reveals that multimodal large language models can spontaneously re-focus on visual inputs during reasoning without explicit guidance, and introduces Look-Back, an implicit method to enhance their visual reasoning capabilities.
Contribution
The paper uncovers the intrinsic ability of MLLMs to re-focus on visual inputs and proposes Look-Back, a novel implicit approach to improve multimodal reasoning without additional input modifications.
Findings
Look-Back improves reasoning performance on multiple benchmarks.
MLLMs can spontaneously re-focus on visual inputs during reasoning.
The approach enhances both reasoning and perception capabilities.
Abstract
Multimodal Large Language Models (MLLMs) have achieved remarkable progress in multimodal reasoning. However, they often excessively rely on textual information during the later stages of inference, neglecting the crucial integration of visual input. Current methods typically address this by explicitly injecting visual information to guide the reasoning process. In this work, through an analysis of MLLM attention patterns, we made an intriguing observation: with appropriate guidance, MLLMs can spontaneously re-focus their attention on visual inputs during the later stages of reasoning, even without explicit visual information injection. This spontaneous shift in focus suggests that MLLMs are intrinsically capable of performing visual fusion reasoning. Building on this insight, we introduce Look-Back, an implicit approach designed to guide MLLMs to ``look back" at visual information in a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
