Multimodal Transformer with Multi-View Visual Representation for Image   Captioning

Jun Yu; Jing Li; Zhou Yu; Qingming Huang

arXiv:1905.07841·cs.CV·May 21, 2019·30 cites

Multimodal Transformer with Multi-View Visual Representation for Image Captioning

Jun Yu, Jing Li, Zhou Yu, Qingming Huang

PDF

Open Access

TL;DR

This paper introduces a Multimodal Transformer model for image captioning that captures both intra- and inter-modal interactions using a unified attention mechanism, enhanced by multi-view visual features, achieving state-of-the-art results.

Contribution

The paper proposes a novel Multimodal Transformer architecture that models intra- and inter-modal interactions simultaneously and integrates multi-view visual features for improved image captioning.

Findings

01

Outperforms previous state-of-the-art methods on MSCOCO dataset

02

Achieves 1st place on MSCOCO captioning challenge leaderboard

03

Demonstrates the effectiveness of multi-view features and unified attention in captioning quality

Abstract

Image captioning aims to automatically generate a natural language description of a given image, and most state-of-the-art models have adopted an encoder-decoder framework. The framework consists of a convolution neural network (CNN)-based image encoder that extracts region-based visual features from the input image, and an recurrent neural network (RNN)-based caption decoder that generates the output caption words based on the visual features with the attention mechanism. Despite the success of existing studies, current methods only model the co-attention that characterizes the inter-modal interactions while neglecting the self-attention that characterizes the intra-modal interactions. Inspired by the success of the Transformer model in machine translation, here we extend it to a Multimodal Transformer (MT) model for image captioning. Compared to existing image captioning approaches,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax