Mitigating Cross-Image Information Leakage in LVLMs for Multi-Image Tasks
Yeji Park, Minyoung Lee, Sanghyuk Chun, Junsuk Choe

TL;DR
This paper introduces FOCUS, a training-free decoding strategy that reduces cross-image information leakage in LVLMs, significantly improving multi-image task performance without requiring retraining or architecture changes.
Contribution
The paper proposes FOCUS, a novel, training-free, architecture-agnostic decoding method to mitigate cross-image information leakage in LVLMs during inference.
Findings
FOCUS improves accuracy across four multi-image benchmarks.
FOCUS is compatible with diverse LVLM architectures.
The method enhances multi-image reasoning without additional training.
Abstract
Large Vision-Language Models (LVLMs) demonstrate strong performance on single-image tasks. However, we observe that their performance degrades significantly when handling multi-image inputs. This occurs because visual cues from different images become entangled in the model's output. We refer to this phenomenon as cross-image information leakage. To address this issue, we propose FOCUS, a training-free and architecture-agnostic decoding strategy that mitigates cross-image information leakage during inference. FOCUS sequentially masks all but one image with random noise, guiding the model to focus on the single clean image. We repeat this process across all target images to obtain logits under partially masked contexts. These logits are aggregated and then contrastively refined using a noise-only reference input, which suppresses the leakage and yields more accurate outputs. FOCUS…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
