Learning Descriptive Image Captioning via Semipermeable Maximum   Likelihood Estimation

Zihao Yue; Anwen Hu; Liang Zhang; Qin Jin

arXiv:2306.13460·cs.CL·October 31, 2023·1 cites

Learning Descriptive Image Captioning via Semipermeable Maximum Likelihood Estimation

Zihao Yue, Anwen Hu, Liang Zhang, Qin Jin

PDF

Open Access 1 Video

TL;DR

This paper introduces SMILE, a novel training method for image captioning that encourages models to generate more detailed and descriptive captions by selectively optimizing for richness without penalizing conciseness.

Contribution

The paper proposes Semipermeable MaxImum Likelihood Estimation (SMILE), a new training approach that balances richness and conciseness to improve caption descriptiveness.

Findings

01

SMILE significantly improves caption richness on MSCOCO and Flickr30K datasets.

02

Models trained with SMILE produce longer, more detailed captions.

03

Extensive experiments validate the effectiveness of SMILE in enhancing caption quality.

Abstract

Image captioning aims to describe visual content in natural language. As 'a picture is worth a thousand words', there could be various correct descriptions for an image. However, with maximum likelihood estimation as the training objective, the captioning model is penalized whenever its prediction mismatches with the label. For instance, when the model predicts a word expressing richer semantics than the label, it will be penalized and optimized to prefer more concise expressions, referred to as conciseness optimization. In contrast, predictions that are more concise than labels lead to richness optimization. Such conflicting optimization directions could eventually result in the model generating general descriptions. In this work, we introduce Semipermeable MaxImum Likelihood Estimation (SMILE), which allows richness optimization while blocking conciseness optimization, thus…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Learning Descriptive Image Captioning via Semipermeable Maximum Likelihood Estimation· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization