UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings

Zhibin Lan; Liqiang Niu; Fandong Meng; Jie Zhou; Jinsong Su

arXiv:2511.00405·cs.LG·March 3, 2026

UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings

Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, Jinsong Su

PDF

Open Access 2 Models 2 Datasets 3 Reviews

TL;DR

This paper introduces UME-R1, a novel framework for generative multimodal embeddings that leverage reasoning capabilities to outperform traditional discriminative models across diverse tasks.

Contribution

The work pioneers a two-stage training strategy for generative embeddings, combining supervised fine-tuning and reinforcement learning to enhance reasoning and embedding quality.

Findings

01

Generative embeddings outperform discriminative ones in multimodal tasks.

02

Combining discriminative and generative embeddings yields superior performance.

03

Reinforcement learning effectively improves generative embedding quality.

Abstract

The remarkable success of multimodal large language models (MLLMs) has driven advances in multimodal embeddings, yet existing models remain inherently discriminative, limiting their ability to benefit from reasoning-driven generation paradigm. In this work, we pioneer the exploration of generative embeddings, unifying embedding tasks within a generative paradigm. We propose UME-R1, a universal multimodal embedding framework consisting of a two-stage training strategy: a cold-start supervised fine-tuning equips the model with reasoning capabilities and enables it to generate both discriminative and generative embeddings; a subsequent reinforcement learning enhances reasoning and further optimizes generative embedding quality. This pioneering work reveals four key insights: 1) generative embeddings unlock substantial performance gains over conventional discriminative embeddings by…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. The core contribution is novel and well-executed. To my knowledge, this is the first work to systematically incorporate chain-of-thought reasoning into multimodal embedding generation, demonstrating that reasoning-conditioned embeddings can substantially outperform standard discriminative embeddings. The idea of having models generate intermediate reasoning before producing embeddings is intuitive and well-motivated. 2. The two-stage training framework is well-designed and clearly presented.

Weaknesses

1. The terminology and framing claims in the abstract and introduction are too broad and potentially misleading. The paper repeatedly claims to "pioneer the exploration of generative embeddings" and positions UME-R1 as introducing the first "generative" embeddings. However, the term "generative embedding" is overloaded and could reasonably describe several existing approaches. For instance, LamRA and UniIR extract embeddings from generative models during the generation process, which some would

Reviewer 02Rating 6Confidence 3

Strengths

- The integration of generative embeddings into the existing embedding learning paradigm represents a conceptually meaningful extension beyond prior MLLM-based approaches. - The application of reinforcement learning with verifiable rewards is a well-considered adaptation for optimizing embedding quality. - Experiments show consistent and notable improvements on the MMEB-V2 benchmark across multiple modalities.

Weaknesses

1. The improvement achieved by the RL stage is marginal (around one point) relative to its additional computational cost. 2. While the oracle setting illustrates the upper bound of the framework, the resulting scores are overly idealized and offer limited practical relevance without an implementable selector.

Reviewer 03Rating 6Confidence 1

Strengths

1. Generative embeddings are worth studying and have certain commonalities for multimodal tasks. This work explores a reasoning-driven method that is brave and novel. 2. The experiment result is solid with a newly constructed cold-start supervised fine-tuning dataset for embedding training with intermediate reasoning and summaries. 3. Convincing visualization examples are provided to prove the effectiveness of the proposed method.

Weaknesses

1. Lack of a comparison framework between the traditional multimodal embeddings and the proposed method of generative embeddings. Providing this may help better presentation. 2. Visual examples require relatively detailed explanations; otherwise, they may feel complex and difficult to determine the more efficient aspects of the proposed method when viewed directly.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling