The Cost of Context: Mitigating Textual Bias in Multimodal Retrieval-Augmented Generation
Hoin Jung, Xiaoqian Wang

TL;DR
This paper investigates how external context can cause multimodal models to fail and introduces BAIR, a method to improve their reliability by addressing visual and positional biases during inference.
Contribution
It formalizes the recorruption phenomenon in multimodal models and proposes BAIR, a parameter-free inference technique to mitigate these biases and enhance model grounding.
Findings
BAIR restores visual saliency and improves grounding in multimodal models.
The method enhances diagnostic reliability across medical, social, and geospatial benchmarks.
It does not require retraining or fine-tuning of models.
Abstract
While Multimodal Large Language Models (MLLMs) are increasingly integrated with Retrieval-Augmented Generation (RAG) to mitigate hallucinations, the introduction of external documents can conceal severe failure modes at the instance level. We identify and formalize the phenomenon of recorruption, where the introduction of even perfectly accurate "oracle" context causes a capable model to abandon an initially correct prediction. Through a mechanistic diagnosis of internal attention matrices, we show that recorruption is driven by a two-fold attentional collapse: (1) visual blindness, characterized by the systemic suppression of visual attention mass () and sharpness (), and (2) a structural positional bias that forces the model to prioritize boundary tokens over semantic relevance. Our analysis reveals an Illusion of Success, demonstrating that many seemingly correct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
