Fusion Models for Improved Visual Captioning

Marimuthu Kalimuthu; Aditya Mogadala; Marius Mosbach; Dietrich Klakow

arXiv:2010.15251·cs.CV·March 1, 2021

Fusion Models for Improved Visual Captioning

Marimuthu Kalimuthu, Aditya Mogadala, Marius Mosbach, Dietrich Klakow

PDF

TL;DR

This paper introduces a multimodal fusion framework that integrates pretrained language models into visual captioning systems, improving caption quality and error correction on standard datasets.

Contribution

It proposes a generic fusion framework for combining pretrained language models with visual captioning models and demonstrates its effectiveness in caption emendation tasks.

Findings

01

Improved caption quality on Flickr8k, Flickr30k, MSCOCO datasets.

02

Effective syntactic and semantic error correction in captions.

03

Fusion strategies enhance traditional captioning models.

Abstract

Visual captioning aims to generate textual descriptions given images or videos. Traditionally, image captioning models are trained on human annotated datasets such as Flickr30k and MS-COCO, which are limited in size and diversity. This limitation hinders the generalization capabilities of these models while also rendering them liable to making mistakes. Language models can, however, be trained on vast amounts of freely available unlabelled data and have recently emerged as successful language encoders and coherent text generators. Meanwhile, several unimodal and multimodal fusion techniques have been proven to work well for natural language generation and automatic speech recognition. Building on these recent developments, and with the aim of improving the quality of generated captions, the contribution of our work in this paper is two-fold: First, we propose a generic multimodal model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Layer Normalization · WordPiece · Adam · Softmax · Dense Connections · Dropout · Weight Decay