MuCo: Multi-turn Contrastive Learning for Multimodal Embedding Model
Geonmo Gu, Byeongho Heo, Jaemyung Yu, Jaehui Hwang, Taekyung Kim, Sangmin Lee, HeeJae Jun, Yoohoon Kang, Sangdoo Yun, Dongyoon Han

TL;DR
MuCo introduces a multi-turn contrastive learning framework for multimodal embedding that improves training efficiency and representation coherence by processing related query-target pairs within a shared context.
Contribution
It presents MuCo, a novel multi-turn contrastive learning method that leverages conversational context to enhance multimodal embeddings and training efficiency.
Findings
Achieves state-of-the-art retrieval performance on MMEB and M-BEIR benchmarks.
Significantly improves training efficiency compared to single-turn contrastive methods.
Enhances representation coherence across modalities.
Abstract
Universal Multimodal embedding models built on Multimodal Large Language Models (MLLMs) have traditionally employed contrastive learning, which aligns representations of query-target pairs across different modalities. Yet, despite its empirical success, they are primarily built on a "single-turn" formulation where each query-target pair is treated as an independent data point. This paradigm leads to computational inefficiency when scaling, as it requires a separate forward pass for each pair and overlooks potential contextual relationships between multiple queries that can relate to the same context. In this work, we introduce Multi-Turn Contrastive Learning (MuCo), a dialogue-inspired framework that revisits this process. MuCo leverages the conversational nature of MLLMs to process multiple, related query-target pairs associated with a single image within a single forward pass. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
