ERNIE-ViLG: Unified Generative Pre-training for Bidirectional   Vision-Language Generation

Han Zhang; Weichong Yin; Yewei Fang; Lanxin Li; Boqiang Duan; Zhihua; Wu; Yu Sun; Hao Tian; Hua Wu; Haifeng Wang

arXiv:2112.15283·cs.CV·January 3, 2022·30 cites

ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation

Han Zhang, Weichong Yin, Yewei Fang, Lanxin Li, Boqiang Duan, Zhihua, Wu, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang

PDF

Open Access 2 Repos

TL;DR

ERNIE-ViLG introduces a unified transformer-based pre-training framework for bidirectional image-text generation, significantly improving performance on both text-to-image synthesis and image captioning tasks with a large-scale dataset.

Contribution

The paper presents ERNIE-ViLG, a novel unified generative pre-training model for bidirectional vision-language tasks, utilizing autoregressive modeling conditioned on input modalities.

Findings

01

Achieves state-of-the-art FID of 7.9 on MS-COCO for text-to-image synthesis.

02

Outperforms previous models on COCO-CN and AIC-ICC for image captioning.

03

Trained on 145 million Chinese image-text pairs with 10-billion parameters.

Abstract

Conventional methods for the image-text generation tasks mainly tackle the naturally bidirectional generation tasks separately, focusing on designing task-specific frameworks to improve the quality and fidelity of the generated samples. Recently, Vision-Language Pre-training models have greatly improved the performance of the image-to-text generation tasks, but large-scale pre-training models for text-to-image synthesis task are still under-developed. In this paper, we propose ERNIE-ViLG, a unified generative pre-training framework for bidirectional image-text generation with transformer model. Based on the image quantization models, we formulate both image generation and text generation as autoregressive generative tasks conditioned on the text/image input. The bidirectional image-text generative modeling eases the semantic alignments across vision and language. For the text-to-image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications