Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular   Vision-Language Pre-training

Yehao Li; Jiahao Fan; Yingwei Pan; Ting Yao; Weiyao Lin; and Tao Mei

arXiv:2201.04026·cs.CV·January 12, 2022·1 cites

Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

Yehao Li, Jiahao Fan, Yingwei Pan, Ting Yao, Weiyao Lin, and Tao Mei

PDF

Open Access

TL;DR

Uni-EDEN is a versatile Transformer-based model pre-trained on multi-granular vision-language tasks, enabling effective perception and generation in various vision-language applications with strong generalizability.

Contribution

The paper introduces Uni-EDEN, a universal encoder-decoder network that jointly learns multi-modal perception and generation through multi-granular pre-training tasks, unlike existing single-encoder models.

Findings

01

Effective across multiple vision-language tasks

02

Strong generalization after fine-tuning

03

Outperforms previous models in perception and generation

Abstract

Vision-language pre-training has been an emerging and fast-developing research topic, which transfers multi-modal knowledge from rich-resource pre-training task to limited-resource downstream tasks. Unlike existing works that predominantly learn a single generic encoder, we present a pre-trainable Universal Encoder-DEcoder Network (Uni-EDEN) to facilitate both vision-language perception (e.g., visual question answering) and generation (e.g., image captioning). Uni-EDEN is a two-stream Transformer based structure, consisting of three modules: object and sentence encoders that separately learns the representations of each modality, and sentence decoder that enables both multi-modal reasoning and sentence generation via inter-modal interaction. Considering that the linguistic representations of each image can span different granularities in this hierarchy including, from simple to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques

MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Dropout · Layer Normalization · Dense Connections · Multi-Head Attention · Softmax · Byte Pair Encoding · Label Smoothing