Bottleneck Tokens for Unified Multimodal Retrieval

Siyu Sun; Jing Ren; Zhaohe Liao; Dongxiao Mao; Xiangyuan Ren; Yiyi Zhang; Haohua Zhao; Weixiong Lin; Jiang Shaohua; Liqing Zhang; and Yuchao Zheng

arXiv:2604.11095·cs.LG·April 14, 2026

Bottleneck Tokens for Unified Multimodal Retrieval

Siyu Sun, Jing Ren, Zhaohe Liao, Dongxiao Mao, Xiangyuan Ren, Yiyi Zhang, Haohua Zhao, Weixiong Lin, Jiang Shaohua, Liqing Zhang, and Yuchao Zheng

PDF

TL;DR

This paper introduces Bottleneck Tokens and Generative Information Condensation to improve unified multimodal retrieval in decoder-only large language models, achieving state-of-the-art results on diverse datasets.

Contribution

It proposes a novel explicit pooling mechanism with learnable bottleneck tokens and a training method that enhances semantic compression for multimodal retrieval.

Findings

01

Achieves state-of-the-art performance on MMEB-V2 with 78 datasets and 3 modalities.

02

Substantial improvements on semantically demanding tasks like Video-QA.

03

Efficient inference with negligible overhead over traditional pooling methods.

Abstract

Adapting decoder-only multimodal large language models (MLLMs) for unified multimodal retrieval faces two structural gaps. First, existing methods rely on implicit pooling, which overloads the hidden state of a standard vocabulary token (e.g., <EOS>) as the sequence-level representation, a mechanism never designed for information aggregation. Second, contrastive fine-tuning specifies what the embedding should match but provides no token-level guidance on how information should be compressed into it. We address both gaps with two complementary components. Architecturally, we introduce Bottleneck Tokens (BToks), a small set of learnable tokens that serve as a fixed-capacity explicit pooling mechanism. For training, we propose Generative Information Condensation: a next-token prediction objective coupled with a Condensation Mask that severs the direct attention path from target tokens to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.