Look-Back: Implicit Visual Re-focusing in MLLM Reasoning

Shuo Yang; Yuwei Niu; Yuyang Liu; Yang Ye; Bin Lin; Li Yuan

arXiv:2507.03019·cs.CV·July 8, 2025

Look-Back: Implicit Visual Re-focusing in MLLM Reasoning

Shuo Yang, Yuwei Niu, Yuyang Liu, Yang Ye, Bin Lin, Li Yuan

PDF

TL;DR

This paper reveals that multimodal large language models can spontaneously re-focus on visual inputs during reasoning without explicit guidance, and introduces Look-Back, an implicit method to enhance their visual reasoning capabilities.

Contribution

The paper uncovers the intrinsic ability of MLLMs to re-focus on visual inputs and proposes Look-Back, a novel implicit approach to improve multimodal reasoning without additional input modifications.

Findings

01

Look-Back improves reasoning performance on multiple benchmarks.

02

MLLMs can spontaneously re-focus on visual inputs during reasoning.

03

The approach enhances both reasoning and perception capabilities.

Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable progress in multimodal reasoning. However, they often excessively rely on textual information during the later stages of inference, neglecting the crucial integration of visual input. Current methods typically address this by explicitly injecting visual information to guide the reasoning process. In this work, through an analysis of MLLM attention patterns, we made an intriguing observation: with appropriate guidance, MLLMs can spontaneously re-focus their attention on visual inputs during the later stages of reasoning, even without explicit visual information injection. This spontaneous shift in focus suggests that MLLMs are intrinsically capable of performing visual fusion reasoning. Building on this insight, we introduce Look-Back, an implicit approach designed to guide MLLMs to ``look back" at visual information in a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.