EVE: Efficient Vision-Language Pre-training with Masked Prediction and   Modality-Aware MoE

Junyi Chen; Longteng Guo; Jia Sun; Shuai Shao; Zehuan Yuan; Liang Lin,; Dongyu Zhang

arXiv:2308.11971·cs.CV·March 4, 2024

EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE

Junyi Chen, Longteng Guo, Jia Sun, Shuai Shao, Zehuan Yuan, Liang Lin,, Dongyu Zhang

PDF

Open Access

TL;DR

EVE is a scalable, efficient vision-language model that uses a unified Transformer with modality-aware MoE modules and masked signal modeling to achieve state-of-the-art results with faster training.

Contribution

EVE introduces a unified pre-training framework with modality-aware sparse MoE modules and masked signal modeling, simplifying and accelerating vision-language pre-training.

Findings

01

3.5x faster training compared to contrastive methods

02

State-of-the-art performance on vision-language tasks

03

Effective scaling with fewer resources

Abstract

Building scalable vision-language models to learn from diverse, multimodal data remains an open challenge. In this paper, we introduce an Efficient Vision-languagE foundation model, namely EVE, which is one unified multimodal Transformer pre-trained solely by one unified pre-training task. Specifically, EVE encodes both vision and language within a shared Transformer network integrated with modality-aware sparse Mixture-of-Experts (MoE) modules, which capture modality-specific information by selectively switching to different experts. To unify pre-training tasks of vision and language, EVE performs masked signal modeling on image-text pairs to reconstruct masked signals, i.e., image pixels and text tokens, given visible signals. This simple yet effective pre-training objective accelerates training by 3.5x compared to the model pre-trained with Image-Text Contrastive and Image-Text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Layer Normalization · Dense Connections · Absolute Position Encodings · Residual Connection