Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings

Peixi Wu; Ke Mei; Feipeng Ma; Bosong Chai; Zhibin Lan; Chenxi Zhao; Shannan Yan; Jie Chen; Zhangchi Hu; Yansong Peng; Bo Lin; Junjie Zhou; Dacheng Yin; Tianyi Wang; Fengyun Rao; Jing Lyu; Hebei Li; Xiaoyan Sun

arXiv:2604.22280·cs.CV·April 27, 2026

Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings

Peixi Wu, Ke Mei, Feipeng Ma, Bosong Chai, Zhibin Lan, Chenxi Zhao, Shannan Yan, Jie Chen, Zhangchi Hu, Yansong Peng, Bo Lin, Junjie Zhou, Dacheng Yin, Tianyi Wang, Fengyun Rao, Jing Lyu, Hebei Li, Xiaoyan Sun

PDF

TL;DR

This paper introduces RIME, a unified framework for multimodal embeddings that enhances reasoning efficiency and accuracy by combining generative and discriminative approaches with rewrite optimization.

Contribution

The paper proposes RIME, a novel framework that jointly optimizes generation and embedding, and introduces CMA and Refine-RL to improve multimodal retrieval performance.

Findings

01

RIME outperforms previous generative embedding models on multiple benchmarks.

02

It significantly reduces the length of reasoning steps compared to Chain-of-Thought.

03

Experiments show improved retrieval accuracy and efficiency.

Abstract

Multimodal Large Language Models (MLLMs) have emerged as a promising foundation for universal multimodal embeddings. Recent studies have shown that reasoning-driven generative multimodal embeddings can outperform discriminative embeddings on several embedding tasks. However, Chain-of-Thought (CoT) reasoning tends to generate redundant thinking steps and introduce semantic ambiguity in the summarized answers in broader retrieval scenarios. To address this limitation, we propose Rewrite-driven Multimodal Embedding (RIME), a unified framework that jointly optimizes generation and embedding through a retrieval-friendly rewrite. Meanwhile, we present the Cross-Mode Alignment (CMA) to bridge the generative and discriminative embedding spaces, enabling flexible mutual retrieval to trade off efficiency and accuracy. Based on this, we also introduce Refine Reinforcement Learning (Refine-RL) that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.