E5-V: Universal Embeddings with Multimodal Large Language Models

Ting Jiang; Minghui Song; Zihan Zhang; Haizhen Huang; Weiwei Deng,; Feng Sun; Qi Zhang; Deqing Wang; Fuzhen Zhuang

arXiv:2407.12580·cs.CL·July 18, 2024·1 cites

E5-V: Universal Embeddings with Multimodal Large Language Models

Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng,, Feng Sun, Qi Zhang, Deqing Wang, Fuzhen Zhuang

PDF

Open Access 1 Repo 1 Models 3 Reviews

TL;DR

E5-V introduces a universal multimodal embedding framework leveraging large language models, achieving high performance across tasks with minimal multimodal training data and significantly reduced costs.

Contribution

The paper presents E5-V, a novel framework that enables universal multimodal embeddings using only text-based training, reducing costs and data requirements while surpassing previous methods.

Findings

01

E5-V outperforms traditional multimodal models in various tasks.

02

Training on only text data reduces costs by approximately 95%.

03

E5-V achieves or surpasses state-of-the-art results across multiple multimodal tasks.

Abstract

Multimodal large language models (MLLMs) have shown promising advancements in general visual and language understanding. However, the representation of multimodal information using MLLMs remains largely unexplored. In this work, we introduce a new framework, E5-V, designed to adapt MLLMs for achieving universal multimodal embeddings. Our findings highlight the significant potential of MLLMs in representing multimodal inputs compared to previous approaches. By leveraging MLLMs with prompts, E5-V effectively bridges the modality gap between different types of inputs, demonstrating strong performance in multimodal embeddings even without fine-tuning. We propose a single modality training approach for E5-V, where the model is trained exclusively on text pairs. This method demonstrates significant improvements over traditional multimodal training on image-text pairs, while reducing training…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 3

Strengths

1. A new multimodal representation framework leveraging the multimodal understanding capability of MLLMs, with a clear motivation and a simple yet effective method. 2. Strong performance on text-image retrieval and composed image retrieval tasks with text-only training data, eliminating the need for costly multimodal training data and computing.

Weaknesses

There are no major weaknesses. Just some minor weaknesses and a few questions in need of clarification. Minor weaknesses: 1. The authors provide a visualization to illustrate the "modality gap". It would be better to provide a quantitative metric ( such as https: //openreview. net/pdf? id=S7Evzt9uit3 ) to evaluate the gap between image embeddings and text embeddings from MLLMs w/o prompt-based representation. 2. Could you clarify whether the visualization of Figure 3(b) is before or after single

Reviewer 02Rating 6Confidence 3

Strengths

1. The paper is well written. I could understand their motivation, the proposed method, and the experimental results. 2. The proposed method is simple but effective. If we have a good pretrained MLLM, we can easily fine-tune it for multimodal representations. This method can be applied to other combinations of two or more modalities as well. 3. A trained embedding space is free from the modality gap issue, one of the important issues of contrastive learning-based multimodal models. 4. The author

Weaknesses

- I acknowledge that the proposed training method is efficient if we have a good enough pretrained MLLM and that a trained embedding space is free from the modality gap. These properties are excellent. However, I doubt if the comparative experiments are fair in terms of model size. The model size of E5-V (LLaVA-Next-8B) is 8B, which is larger than CLIP, BLIP, and even EVA-CLIP used for comparison. As shown in the [EVA-CLIP paper](https://arxiv.org/abs/2303.15389), the model size affects its perf

Reviewer 03Rating 6Confidence 3

Strengths

Using MLLM so that E5-V could deal with interleaved input and using small batch size. Only trained on text-only data leads to better performance compared to the model trained on multimodal data. Use “summary XXX in on word” to bridge the modality gap between text and vision effectively

Weaknesses

The "summary XXX in one word" prompt enables the model to handle simple image or text inputs. If the input complexity increases, a word cannot illustrate the input text /image well. For new multi-model task, the new prompt needs to be designed to ensure effective performance.

Code & Models

Repositories

kongds/e5-v
pytorchOfficial

Models

🤗
royokong/e5-v
model· 16k dl· ♡ 31
16k dl♡ 31

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques