XGPT: Cross-modal Generative Pre-Training for Image Captioning

Qiaolin Xia; Haoyang Huang; Nan Duan; Dongdong Zhang; Lei Ji; Zhifang; Sui; Edward Cui; Taroon Bharti; Xin Liu; Ming Zhou

arXiv:2003.01473·cs.CL·March 5, 2020·20 cites

XGPT: Cross-modal Generative Pre-Training for Image Captioning

Qiaolin Xia, Haoyang Huang, Nan Duan, Dongdong Zhang, Lei Ji, Zhifang, Sui, Edward Cui, Taroon Bharti, Xin Liu, Ming Zhou

PDF

Open Access

TL;DR

XGPT introduces a novel cross-modal pre-training approach for image captioning, enabling state-of-the-art performance and effective data augmentation for image retrieval tasks.

Contribution

It proposes three new generation tasks for pre-training, allowing direct fine-tuning for image captioning without task-specific modifications.

Findings

01

Achieves new state-of-the-art results on COCO and Flickr30k datasets.

02

Improves image retrieval recall metrics through data augmentation.

03

Demonstrates effective cross-modal pre-training for generation tasks.

Abstract

While many BERT-based cross-modal pre-trained models produce excellent results on downstream understanding tasks like image-text retrieval and VQA, they cannot be applied to generation tasks directly. In this paper, we propose XGPT, a new method of Cross-modal Generative Pre-Training for Image Captioning that is designed to pre-train text-to-image caption generators through three novel generation tasks, including Image-conditioned Masked Language Modeling (IMLM), Image-conditioned Denoising Autoencoding (IDA), and Text-conditioned Image Feature Generation (TIFG). As a result, the pre-trained XGPT can be fine-tuned without any task-specific architecture modifications to create state-of-the-art models for image captioning. Experiments show that XGPT obtains new state-of-the-art results on the benchmark datasets, including COCO Captions and Flickr30k Captions. We also use XGPT to generate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling

MethodsXGPT