Look Twice: Training-Free Evidence Highlighting in Multimodal Large Language Models

Marco Morini; Sara Sarto; Marcella Cornia; Lorenzo Baraldi

arXiv:2604.01280·cs.CV·April 3, 2026

Look Twice: Training-Free Evidence Highlighting in Multimodal Large Language Models

Marco Morini, Sara Sarto, Marcella Cornia, Lorenzo Baraldi

PDF

TL;DR

Look Twice (LoT) is a training-free inference framework that enhances multimodal large language models by highlighting relevant visual and textual evidence during answer generation, improving performance on knowledge-based visual question answering tasks.

Contribution

Introduces a training-free, inference-time evidence highlighting method for pretrained MLLMs that improves their ability to utilize multimodal evidence without additional training.

Findings

01

Consistent improvements on multiple knowledge-based VQA benchmarks.

02

Evidence highlighting enhances performance even without textual context.

03

No additional training or architectural changes needed.

Abstract

Answering questions about images often requires combining visual understanding with external knowledge. Multimodal Large Language Models (MLLMs) provide a natural framework for this setting, but they often struggle to identify the most relevant visual and textual evidence when answering knowledge-intensive queries. In such scenarios, models must integrate visual cues with retrieved textual evidence that is often noisy or only partially relevant, while also localizing fine-grained visual information in the image. In this work, we introduce Look Twice (LoT), a training-free inference-time framework that improves how pretrained MLLMs utilize multimodal evidence. Specifically, we exploit the model attention patterns to estimate which visual regions and retrieved textual elements are relevant to a query, and then generate the answer conditioned on this highlighted evidence. The selected cues…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.