Fusion Models for Improved Visual Captioning
Marimuthu Kalimuthu, Aditya Mogadala, Marius Mosbach, Dietrich Klakow

TL;DR
This paper introduces a multimodal fusion framework that integrates pretrained language models into visual captioning systems, improving caption quality and error correction on standard datasets.
Contribution
It proposes a generic fusion framework for combining pretrained language models with visual captioning models and demonstrates its effectiveness in caption emendation tasks.
Findings
Improved caption quality on Flickr8k, Flickr30k, MSCOCO datasets.
Effective syntactic and semantic error correction in captions.
Fusion strategies enhance traditional captioning models.
Abstract
Visual captioning aims to generate textual descriptions given images or videos. Traditionally, image captioning models are trained on human annotated datasets such as Flickr30k and MS-COCO, which are limited in size and diversity. This limitation hinders the generalization capabilities of these models while also rendering them liable to making mistakes. Language models can, however, be trained on vast amounts of freely available unlabelled data and have recently emerged as successful language encoders and coherent text generators. Meanwhile, several unimodal and multimodal fusion techniques have been proven to work well for natural language generation and automatic speech recognition. Building on these recent developments, and with the aim of improving the quality of generated captions, the contribution of our work in this paper is two-fold: First, we propose a generic multimodal model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Layer Normalization · WordPiece · Adam · Softmax · Dense Connections · Dropout · Weight Decay
