Learning Distinct and Representative Styles for Image Captioning

Qi Chen; Chaorui Deng; Qi Wu

arXiv:2209.08231·cs.CV·August 16, 2023·1 cites

Learning Distinct and Representative Styles for Image Captioning

Qi Chen, Chaorui Deng, Qi Wu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a Discrete Mode Learning paradigm for image captioning that enhances diversity and informativeness by learning and controlling mode embeddings, addressing the mode collapse problem in current methods.

Contribution

It proposes a novel DML framework with a dual architecture combining a CdVAE and MIC branch to learn and utilize mode embeddings for diverse caption generation.

Findings

01

Improved caption diversity and quality on MSCOCO dataset

02

Successful application to Transformer and AoANet models

03

Addresses mode collapse in image captioning

Abstract

Over the years, state-of-the-art (SoTA) image captioning methods have achieved promising results on some evaluation metrics (e.g., CIDEr). However, recent findings show that the captions generated by these methods tend to be biased toward the "average" caption that only captures the most general mode (a.k.a, language pattern) in the training corpus, i.e., the so-called mode collapse problem. Affected by it, the generated captions are limited in diversity and usually less informative than natural image descriptions made by humans. In this paper, we seek to avoid this problem by proposing a Discrete Mode Learning (DML) paradigm for image captioning. Our innovative idea is to explore the rich modes in the training caption corpus to learn a set of "mode embeddings", and further use them to control the mode of the generated captions for existing image captioning models. Specifically, the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bladewaltz1/modecap
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Cancer-related molecular mechanisms research · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Softmax · Dropout · Residual Connection · Dense Connections