Embed-RL: Reinforcement Learning for Reasoning-Driven Multimodal Embeddings

Haonan Jiang; Yuji Wang; Yongjie Zhu; Xin Lu; Wenyu Qin; Meng Wang; Pengfei Wan; and Yansong Tang

arXiv:2602.13823·cs.CV·March 13, 2026

Embed-RL: Reinforcement Learning for Reasoning-Driven Multimodal Embeddings

Haonan Jiang, Yuji Wang, Yongjie Zhu, Xin Lu, Wenyu Qin, Meng Wang, Pengfei Wan, and Yansong Tang

PDF

Open Access

TL;DR

This paper introduces Embed-RL, a reinforcement learning framework that enhances multimodal embeddings by integrating reasoning traces aligned with retrieval tasks, leading to improved cross-modal understanding and performance.

Contribution

We propose a novel reasoning-driven UME framework with explicit supervision, multimodal evidence extraction, and improved performance on benchmark datasets.

Findings

01

Outperforms existing models on MMEB-V2 and UVRB benchmarks.

02

Enhances cross-modal semantic consistency and fine-grained matching.

03

Demonstrates the effectiveness of reasoning optimization in multimodal embeddings.

Abstract

Leveraging Multimodal Large Language Models (MLLMs) has become pivotal for advancing Universal Multimodal Embeddings (UME) in addressing diverse cross-modal tasks. Recent studies demonstrate that incorporating generative Chain-of-Thought (CoT) reasoning can substantially enhance task-specific representations compared to discriminative methods. However, the generated reasoning CoTs of existing generative embedding methods are limited to the textual analysis of queries and are irrelevant to the retrieval of the targets. To address these limitations, we propose a reasoning-driven UME framework that integrates Embedder-Guided Reinforcement Learning (EG-RL) to optimize the Reasoner to produce evidential Traceability CoT (T-CoT). Our key contributions are threefold: (1) We design an EG-RL framework where the Embedder provides explicit supervision to the Reasoner, ensuring the generated CoT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Graph Neural Networks