MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models

Haohang Huang; Xuan Lu; Mingyi Su; Xuan Zhang; Ziyan Jiang; Ping Nie; Kai Zou; Tomas Pfister; Wenhu Chen; Wei Zhang; Xiaoyu Shen; Rui Meng

arXiv:2604.23321·cs.IR·April 28, 2026

MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models

Haohang Huang, Xuan Lu, Mingyi Su, Xuan Zhang, Ziyan Jiang, Ping Nie, Kai Zou, Tomas Pfister, Wenhu Chen, Wei Zhang, Xiaoyu Shen, Rui Meng

PDF

TL;DR

This paper introduces MMEB-V3, a comprehensive benchmark for evaluating full-modality embeddings across text, images, videos, and audio, revealing current models' limitations in modality-aware retrieval.

Contribution

The work presents MMEB-V3 and OmniSET, enabling systematic evaluation and diagnosis of full-modality embedding models, addressing a gap in existing benchmarks.

Findings

01

Models often fail to retrieve the target modality.

02

Cross-modal retrieval is highly asymmetric and query-modality biased.

03

Instruction-induced shifts are insufficient or misaligned with target modalities.

Abstract

Multimodal embedding models aim to map heterogeneous inputs, such as text, images, videos, and audio, into a shared semantic space. However, existing methods and benchmarks remain largely limited to partial modality coverage, making it difficult to systematically evaluate full-modality representation learning. In this work, we take a step toward the full-modality setting. We introduce MMEB-V3, a comprehensive benchmark that evaluates embeddings across text, image, video, audio, as well as agent-centric scenarios. To enable more fine-grained diagnosis, we further construct OmniSET (Omni-modality Semantic Equivalence Tuples), where semantically equivalent instances are represented across modalities, allowing us to disentangle semantic similarity from modality effects. Through experiments on MMEB-V3, we conduct a systematic analysis of full-modality embeddings and identify three key…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.