Generative Giants, Retrieval Weaklings: Why do Multimodal Large Language Models Fail at Multimodal Retrieval?

Hengyi Feng; Zeang Sheng; Meiyi Qiang; Yang Li; Wentao Zhang

arXiv:2512.19115·cs.CV·May 12, 2026

Generative Giants, Retrieval Weaklings: Why do Multimodal Large Language Models Fail at Multimodal Retrieval?

Hengyi Feng, Zeang Sheng, Meiyi Qiang, Yang Li, Wentao Zhang

PDF

1 Repo 1 Models

TL;DR

This paper investigates why multimodal large language models underperform in zero-shot retrieval tasks, revealing semantic imbalance issues and proposing a simple whitening transformation to improve performance without fine-tuning.

Contribution

The study uncovers the semantic imbalance in MLLM representations and introduces ReAlign, a test-time adaptation method that enhances retrieval performance.

Findings

01

MLLM representations are dominated by textual semantics.

02

Visual semantics are a small portion of the representation space.

03

ReAlign improves zero-shot retrieval performance across various MLLMs.

Abstract

Despite the remarkable success of multimodal large language models (MLLMs) in generative tasks, we observe that they exhibit a counterintuitive deficiency in the zero-shot multimodal retrieval task. In this work, we investigate the underlying mechanisms that hinder MLLMs from being effective retrievers. With the help of sparse autoencoders (SAEs), we decompose MLLM output representations into interpretable semantic concepts to probe their intrinsic behavior. Our analysis reveals that the representation space of MLLMs is overwhelmingly dominated by textual semantics; and the visual semantics essential for multimodal retrieval only constitute a small portion. We find that this imbalance is compounded by the heavy focus of MLLMs on bridging image-text modalities, which facilitates generation but homogenizes embeddings and finally diminishes the discriminative power required for multimodal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Heinz217/mllm-retrieval-analysis
github

Models

🤗
Heinz217/mllm-retrieval-analysis-sae
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.