Image Captioning In the Transformer Age

Yang Xu; Li Li; Haiyang Xu; Songfang Huang; Fei Huang; Jianfei Cai

arXiv:2204.07374·cs.CV·April 18, 2022·6 cites

Image Captioning In the Transformer Age

Yang Xu, Li Li, Haiyang Xu, Songfang Huang, Fei Huang, Jianfei Cai

PDF

Open Access 1 Repo

TL;DR

This paper discusses the transition of image captioning from CNN-RNN architectures to Transformer-based models, emphasizing end-to-end training and the role of self-supervised learning in enhancing IC performance.

Contribution

It provides a survey of recent advances in Transformer-based image captioning and analyzes the significance of IC amidst large-scale self-supervised models.

Findings

01

Transformers enable homogeneous, end-to-end trainable IC models.

02

Self-supervised pre-training enhances IC performance.

03

IC maintains unique importance despite large-scale model developments.

Abstract

Image Captioning (IC) has achieved astonishing developments by incorporating various techniques into the CNN-RNN encoder-decoder architecture. However, since CNN and RNN do not share the basic network component, such a heterogeneous pipeline is hard to be trained end-to-end where the visual encoder will not learn anything from the caption supervision. This drawback inspires the researchers to develop a homogeneous architecture that facilitates end-to-end training, for which Transformer is the perfect one that has proven its huge potential in both vision and language domains and thus can be used as the basic component of the visual encoder and language decoder in an IC pipeline. Meantime, self-supervised learning releases the power of the Transformer architecture that a pre-trained large-scale one can be generalized to various tasks including IC. The success of these large-scale models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sjokerlily/awesome-image-captioning
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition

MethodsAttention Is All You Need · Linear Layer · Dropout · Adam · Multi-Head Attention · Residual Connection · Label Smoothing · Absolute Position Encodings · Byte Pair Encoding · Position-Wise Feed-Forward Layer