Multimodal Transformer with Multi-View Visual Representation for Image Captioning
Jun Yu, Jing Li, Zhou Yu, Qingming Huang

TL;DR
This paper introduces a Multimodal Transformer model for image captioning that captures both intra- and inter-modal interactions using a unified attention mechanism, enhanced by multi-view visual features, achieving state-of-the-art results.
Contribution
The paper proposes a novel Multimodal Transformer architecture that models intra- and inter-modal interactions simultaneously and integrates multi-view visual features for improved image captioning.
Findings
Outperforms previous state-of-the-art methods on MSCOCO dataset
Achieves 1st place on MSCOCO captioning challenge leaderboard
Demonstrates the effectiveness of multi-view features and unified attention in captioning quality
Abstract
Image captioning aims to automatically generate a natural language description of a given image, and most state-of-the-art models have adopted an encoder-decoder framework. The framework consists of a convolution neural network (CNN)-based image encoder that extracts region-based visual features from the input image, and an recurrent neural network (RNN)-based caption decoder that generates the output caption words based on the visual features with the attention mechanism. Despite the success of existing studies, current methods only model the co-attention that characterizes the inter-modal interactions while neglecting the self-attention that characterizes the intra-modal interactions. Inspired by the success of the Transformer model in machine translation, here we extend it to a Multimodal Transformer (MT) model for image captioning. Compared to existing image captioning approaches,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax
