MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control
Yuchi Wang, Haiyang Yu, Weikang Bian, Jiefeng Long, Xiao Liang, Chao Feng, Hongsheng Li

TL;DR
MMEmb-R1 introduces an adaptive, reasoning-enhanced multimodal embedding framework that selectively employs reasoning to improve alignment and efficiency, achieving state-of-the-art results with reduced latency.
Contribution
It proposes a novel adaptive reasoning approach with pair-aware selection and reinforcement learning, addressing structural misalignment and unnecessary reasoning in multimodal embedding.
Findings
Achieves 71.2 score on MMEB-V2 benchmark with 4B parameters.
Reduces reasoning overhead and inference latency significantly.
Establishes new state-of-the-art in multimodal embedding tasks.
Abstract
MLLMs have been successfully applied to multimodal embedding tasks, yet their generative reasoning capabilities remain underutilized. Directly incorporating chain-of-thought reasoning into embedding learning introduces two fundamental challenges. First, structural misalignment between instance-level reasoning and pairwise contrastive supervision may lead to shortcut behavior, where the model merely learns the superficial format of reasoning. Second, reasoning is not universally beneficial for embedding tasks. Enforcing reasoning for all inputs may introduce unnecessary computation and latency, and can even obscure salient semantic signals for simple cases. To address these issues, we propose MMEmb-R1, an adaptive reasoning-based multimodal embedding framework. We formulate reasoning as a latent variable and introduce pair-aware reasoning selection that employs counterfactual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
