Meshed-Memory Transformer for Image Captioning

Marcella Cornia; Matteo Stefanini; Lorenzo Baraldi; Rita Cucchiara

arXiv:1912.08226·cs.CV·March 24, 2020

Meshed-Memory Transformer for Image Captioning

Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, Rita Cucchiara

PDF

2 Repos 1 Video

TL;DR

This paper introduces M$^2$, a Meshed Memory Transformer architecture for image captioning that enhances image encoding and language generation, achieving state-of-the-art results on COCO dataset.

Contribution

It presents a novel Meshed Memory Transformer model that improves multi-level image region relationships and feature integration for image captioning tasks.

Findings

01

Achieves new state-of-the-art on COCO dataset.

02

Outperforms recurrent models in image captioning.

03

Effective in describing unseen objects.

Abstract

Transformer-based architectures represent the state of the art in sequence modeling tasks like machine translation and language understanding. Their applicability to multi-modal contexts like image captioning, however, is still largely under-explored. With the aim of filling this gap, we present M $^{2}$ - a Meshed Transformer with Memory for Image Captioning. The architecture improves both the image encoding and the language generation steps: it learns a multi-level representation of the relationships between image regions integrating learned a priori knowledge, and uses a mesh-like connectivity at decoding stage to exploit low- and high-level features. Experimentally, we investigate the performance of the M $^{2}$ Transformer and different fully-attentive models in comparison with recurrent ones. When tested on COCO, our proposal achieves a new state of the art in single-model and ensemble…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Meshed-Memory Transformer for Image Captioning· youtube

Taxonomy

MethodsTest · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam