Magic-MM-Embedding: Towards Visual-Token-Efficient Universal Multimodal Embedding with MLLMs

Qi Li; Yanzhe Zhao; Yongxin Zhou; Yameng Wang; Yandong Yang; Yuanjia Zhou; Jue Wang; Zuojian Wang; Jinxiang Liu

arXiv:2602.05275·cs.CV·February 6, 2026

Magic-MM-Embedding: Towards Visual-Token-Efficient Universal Multimodal Embedding with MLLMs

Qi Li, Yanzhe Zhao, Yongxin Zhou, Yameng Wang, Yandong Yang, Yuanjia Zhou, Jue Wang, Zuojian Wang, Jinxiang Liu

PDF

Open Access

TL;DR

Magic-MM-Embedding introduces an efficient multimodal embedding model that reduces computational costs while achieving state-of-the-art performance through a multi-stage training strategy.

Contribution

It presents a novel architecture with visual token compression and a progressive training paradigm to enhance efficiency and performance in multimodal retrieval.

Findings

01

Outperforms existing methods in accuracy and efficiency

02

Reduces inference latency and memory footprint significantly

03

Achieves state-of-the-art results on multimodal retrieval benchmarks

Abstract

Multimodal Large Language Models (MLLMs) have shown immense promise in universal multimodal retrieval, which aims to find relevant items of various modalities for a given query. But their practical application is often hindered by the substantial computational cost incurred from processing a large number of tokens from visual inputs. In this paper, we propose Magic-MM-Embedding, a series of novel models that achieve both high efficiency and state-of-the-art performance in universal multimodal embedding. Our approach is built on two synergistic pillars: (1) a highly efficient MLLM architecture incorporating visual token compression to drastically reduce inference latency and memory footprint, and (2) a multi-stage progressive training strategy designed to not only recover but significantly boost performance. This coarse-to-fine training paradigm begins with extensive continue pretraining…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Advanced Graph Neural Networks