mmE5: Improving Multimodal Multilingual Embeddings via High-quality   Synthetic Data

Haonan Chen; Liang Wang; Nan Yang; Yutao Zhu; Ziliang Zhao; Furu Wei,; Zhicheng Dou

arXiv:2502.08468·cs.CV·February 13, 2025

mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data

Haonan Chen, Liang Wang, Nan Yang, Yutao Zhu, Ziliang Zhao, Furu Wei,, Zhicheng Dou

PDF

Open Access 1 Repo 1 Models 3 Datasets

TL;DR

This paper introduces mmE5, a multimodal multilingual embedding model trained on high-quality synthetic data that covers diverse tasks and modalities, leading to state-of-the-art results in multimodal and multilingual benchmarks.

Contribution

The work presents a novel data synthesis approach guided by three quality criteria, enabling the creation of diverse, aligned, and realistic synthetic datasets for training multimodal multilingual models.

Findings

01

mmE5 achieves state-of-the-art results on MMEB benchmark.

02

Synthetic data quality significantly impacts embedding performance.

03

High-quality synthetic datasets improve multilingual multimodal understanding.

Abstract

Multimodal embedding models have gained significant attention for their ability to map data from different modalities, such as text and images, into a unified representation space. However, the limited labeled multimodal data often hinders embedding performance. Recent approaches have leveraged data synthesis to address this problem, yet the quality of synthetic data remains a critical bottleneck. In this work, we identify three criteria for high-quality synthetic multimodal data. First, broad scope ensures that the generated data covers diverse tasks and modalities, making it applicable to various downstream scenarios. Second, robust cross-modal alignment makes different modalities semantically consistent. Third, high fidelity ensures that the synthetic data maintains realistic details to enhance its reliability. Guided by these principles, we synthesize datasets that: (1) cover a wide…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

haon-chen/mme5
pytorchOfficial

Models

🤗
intfloat/mmE5-mllama-11b-instruct
model· 58 dl· ♡ 20
58 dl♡ 20

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Speech Recognition and Synthesis

MethodsSoftmax · Attention Is All You Need