MuCo: Multi-turn Contrastive Learning for Multimodal Embedding Model

Geonmo Gu; Byeongho Heo; Jaemyung Yu; Jaehui Hwang; Taekyung Kim; Sangmin Lee; HeeJae Jun; Yoohoon Kang; Sangdoo Yun; Dongyoon Han

arXiv:2602.06393·cs.IR·April 3, 2026

MuCo: Multi-turn Contrastive Learning for Multimodal Embedding Model

Geonmo Gu, Byeongho Heo, Jaemyung Yu, Jaehui Hwang, Taekyung Kim, Sangmin Lee, HeeJae Jun, Yoohoon Kang, Sangdoo Yun, Dongyoon Han

PDF

1 Repo 2 Models 1 Datasets

TL;DR

MuCo introduces a multi-turn contrastive learning framework for multimodal embedding that improves training efficiency and representation coherence by processing related query-target pairs within a shared context.

Contribution

It presents MuCo, a novel multi-turn contrastive learning method that leverages conversational context to enhance multimodal embeddings and training efficiency.

Findings

01

Achieves state-of-the-art retrieval performance on MMEB and M-BEIR benchmarks.

02

Significantly improves training efficiency compared to single-turn contrastive methods.

03

Enhances representation coherence across modalities.

Abstract

Universal Multimodal embedding models built on Multimodal Large Language Models (MLLMs) have traditionally employed contrastive learning, which aligns representations of query-target pairs across different modalities. Yet, despite its empirical success, they are primarily built on a "single-turn" formulation where each query-target pair is treated as an independent data point. This paradigm leads to computational inefficiency when scaling, as it requires a separate forward pass for each pair and overlooks potential contextual relationships between multiple queries that can relate to the same context. In this work, we introduce Multi-Turn Contrastive Learning (MuCo), a dialogue-inspired framework that revisits this process. MuCo leverages the conversational nature of MLLMs to process multiple, related query-target pairs associated with a single image within a single forward pass. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

naver-ai/muco
github

Models

Datasets

naver-ai/M3T
dataset· 545 dl
545 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.