MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models
Haohang Huang, Xuan Lu, Mingyi Su, Xuan Zhang, Ziyan Jiang, Ping Nie, Kai Zou, Tomas Pfister, Wenhu Chen, Wei Zhang, Xiaoyu Shen, Rui Meng

TL;DR
This paper introduces MMEB-V3, a comprehensive benchmark for evaluating full-modality embeddings across text, images, videos, and audio, revealing current models' limitations in modality-aware retrieval.
Contribution
The work presents MMEB-V3 and OmniSET, enabling systematic evaluation and diagnosis of full-modality embedding models, addressing a gap in existing benchmarks.
Findings
Models often fail to retrieve the target modality.
Cross-modal retrieval is highly asymmetric and query-modality biased.
Instruction-induced shifts are insufficient or misaligned with target modalities.
Abstract
Multimodal embedding models aim to map heterogeneous inputs, such as text, images, videos, and audio, into a shared semantic space. However, existing methods and benchmarks remain largely limited to partial modality coverage, making it difficult to systematically evaluate full-modality representation learning. In this work, we take a step toward the full-modality setting. We introduce MMEB-V3, a comprehensive benchmark that evaluates embeddings across text, image, video, audio, as well as agent-centric scenarios. To enable more fine-grained diagnosis, we further construct OmniSET (Omni-modality Semantic Equivalence Tuples), where semantically equivalent instances are represented across modalities, allowing us to disentangle semantic similarity from modality effects. Through experiments on MMEB-V3, we conduct a systematic analysis of full-modality embeddings and identify three key…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
