Image Captioning In the Transformer Age
Yang Xu, Li Li, Haiyang Xu, Songfang Huang, Fei Huang, Jianfei Cai

TL;DR
This paper discusses the transition of image captioning from CNN-RNN architectures to Transformer-based models, emphasizing end-to-end training and the role of self-supervised learning in enhancing IC performance.
Contribution
It provides a survey of recent advances in Transformer-based image captioning and analyzes the significance of IC amidst large-scale self-supervised models.
Findings
Transformers enable homogeneous, end-to-end trainable IC models.
Self-supervised pre-training enhances IC performance.
IC maintains unique importance despite large-scale model developments.
Abstract
Image Captioning (IC) has achieved astonishing developments by incorporating various techniques into the CNN-RNN encoder-decoder architecture. However, since CNN and RNN do not share the basic network component, such a heterogeneous pipeline is hard to be trained end-to-end where the visual encoder will not learn anything from the caption supervision. This drawback inspires the researchers to develop a homogeneous architecture that facilitates end-to-end training, for which Transformer is the perfect one that has proven its huge potential in both vision and language domains and thus can be used as the basic component of the visual encoder and language decoder in an IC pipeline. Meantime, self-supervised learning releases the power of the Transformer architecture that a pre-trained large-scale one can be generalized to various tasks including IC. The success of these large-scale models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
MethodsAttention Is All You Need · Linear Layer · Dropout · Adam · Multi-Head Attention · Residual Connection · Label Smoothing · Absolute Position Encodings · Byte Pair Encoding · Position-Wise Feed-Forward Layer
