e5-omni: Explicit Cross-modal Alignment for Omni-modal Embeddings

Haonan Chen; Sicheng Gao; Radu Timofte; Tetsuya Sakai; Zhicheng Dou

arXiv:2601.03666·cs.CL·January 12, 2026

e5-omni: Explicit Cross-modal Alignment for Omni-modal Embeddings

Haonan Chen, Sicheng Gao, Radu Timofte, Tetsuya Sakai, Zhicheng Dou

PDF

Open Access 2 Models

TL;DR

e5-omni introduces an explicit alignment approach for omni-modal embeddings, addressing scale, negative sampling, and statistical mismatches to improve cross-modal comparison across diverse data types.

Contribution

The paper proposes a lightweight, explicit alignment method for omni-modal embeddings that enhances robustness and consistency over implicit, pretrained VLM-based models.

Findings

01

Consistent performance improvements on MMEB-V2 and AudioCaps datasets.

02

Effective transferability of the alignment recipe to various VLM backbones.

03

Enhanced stability and comparability of cross-modal similarity scores.

Abstract

Modern information systems often involve different types of items, e.g., a text query, an image, a video clip, or an audio segment. This motivates omni-modal embedding models that map heterogeneous modalities into a shared space for direct comparison. However, most recent omni-modal embeddings still rely heavily on implicit alignment inherited from pretrained vision-language model (VLM) backbones. In practice, this causes three common issues: (i) similarity logits have modality-dependent sharpness, so scores are not on a consistent scale; (ii) in-batch negatives become less effective over time because mixed-modality batches create an imbalanced hardness distribution; as a result, many negatives quickly become trivial and contribute little gradient; and (iii) embeddings across modalities show mismatched first- and second-order statistics, which makes rankings less stable. To tackle these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning