TL;DR
This paper investigates why multimodal large language models underperform in zero-shot retrieval tasks, revealing semantic imbalance issues and proposing a simple whitening transformation to improve performance without fine-tuning.
Contribution
The study uncovers the semantic imbalance in MLLM representations and introduces ReAlign, a test-time adaptation method that enhances retrieval performance.
Findings
MLLM representations are dominated by textual semantics.
Visual semantics are a small portion of the representation space.
ReAlign improves zero-shot retrieval performance across various MLLMs.
Abstract
Despite the remarkable success of multimodal large language models (MLLMs) in generative tasks, we observe that they exhibit a counterintuitive deficiency in the zero-shot multimodal retrieval task. In this work, we investigate the underlying mechanisms that hinder MLLMs from being effective retrievers. With the help of sparse autoencoders (SAEs), we decompose MLLM output representations into interpretable semantic concepts to probe their intrinsic behavior. Our analysis reveals that the representation space of MLLMs is overwhelmingly dominated by textual semantics; and the visual semantics essential for multimodal retrieval only constitute a small portion. We find that this imbalance is compounded by the heavy focus of MLLMs on bridging image-text modalities, which facilitates generation but homogenizes embeddings and finally diminishes the discriminative power required for multimodal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
