ReMatch: Boosting Representation through Matching for Multimodal Retrieval
Qianying Liu, Xiao Liang, Zhiqiang Zhang, Zhongfei Qing, Fengfan Zhou, Yibo Chen, Xu Tang, Yao Hu, Paul Henderson

TL;DR
ReMatch introduces a novel multimodal retrieval framework that leverages the generative capabilities of MLLMs through end-to-end training and generative matching, achieving state-of-the-art results and strong zero-shot generalization.
Contribution
It proposes a new training approach that utilizes generative matching with MLLMs for improved multimodal retrieval performance.
Findings
Achieves new state-of-the-art on MMEB benchmark.
Demonstrates strong zero-shot generalization across five datasets.
Utilizes multiple learnable tokens for richer embeddings.
Abstract
We present ReMatch, a framework that leverages the generative strength of MLLMs for multimodal retrieval. Previous approaches treated an MLLM as a simple encoder, ignoring its generative nature, and under-utilising its compositional reasoning and world knowledge. We instead train the embedding MLLM end-to-end with a chat-style generative matching stage. The matching stage uses the same MLLM to autoregressively decide relevance from multi-view inputs, including both raw data and its own projected embeddings for each query and document. It provides instance-wise discrimination supervision that complements a standard contrastive loss, offering stronger gradients on hard negatives and preserving the compositional strengths of the original MLLM. To obtain semantically richer multimodal embeddings, we use multiple learnable tokens to augment each input, generating fine-grained contextual,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
