Bottleneck Tokens for Unified Multimodal Retrieval
Siyu Sun, Jing Ren, Zhaohe Liao, Dongxiao Mao, Xiangyuan Ren, Yiyi Zhang, Haohua Zhao, Weixiong Lin, Jiang Shaohua, Liqing Zhang, and Yuchao Zheng

TL;DR
This paper introduces Bottleneck Tokens and Generative Information Condensation to improve unified multimodal retrieval in decoder-only large language models, achieving state-of-the-art results on diverse datasets.
Contribution
It proposes a novel explicit pooling mechanism with learnable bottleneck tokens and a training method that enhances semantic compression for multimodal retrieval.
Findings
Achieves state-of-the-art performance on MMEB-V2 with 78 datasets and 3 modalities.
Substantial improvements on semantically demanding tasks like Video-QA.
Efficient inference with negligible overhead over traditional pooling methods.
Abstract
Adapting decoder-only multimodal large language models (MLLMs) for unified multimodal retrieval faces two structural gaps. First, existing methods rely on implicit pooling, which overloads the hidden state of a standard vocabulary token (e.g., <EOS>) as the sequence-level representation, a mechanism never designed for information aggregation. Second, contrastive fine-tuning specifies what the embedding should match but provides no token-level guidance on how information should be compressed into it. We address both gaps with two complementary components. Architecturally, we introduce Bottleneck Tokens (BToks), a small set of learnable tokens that serve as a fixed-capacity explicit pooling mechanism. For training, we propose Generative Information Condensation: a next-token prediction objective coupled with a Condensation Mask that severs the direct attention path from target tokens to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
