Learning Descriptive Image Captioning via Semipermeable Maximum Likelihood Estimation
Zihao Yue, Anwen Hu, Liang Zhang, Qin Jin

TL;DR
This paper introduces SMILE, a novel training method for image captioning that encourages models to generate more detailed and descriptive captions by selectively optimizing for richness without penalizing conciseness.
Contribution
The paper proposes Semipermeable MaxImum Likelihood Estimation (SMILE), a new training approach that balances richness and conciseness to improve caption descriptiveness.
Findings
SMILE significantly improves caption richness on MSCOCO and Flickr30K datasets.
Models trained with SMILE produce longer, more detailed captions.
Extensive experiments validate the effectiveness of SMILE in enhancing caption quality.
Abstract
Image captioning aims to describe visual content in natural language. As 'a picture is worth a thousand words', there could be various correct descriptions for an image. However, with maximum likelihood estimation as the training objective, the captioning model is penalized whenever its prediction mismatches with the label. For instance, when the model predicts a word expressing richer semantics than the label, it will be penalized and optimized to prefer more concise expressions, referred to as conciseness optimization. In contrast, predictions that are more concise than labels lead to richness optimization. Such conflicting optimization directions could eventually result in the model generating general descriptions. In this work, we introduce Semipermeable MaxImum Likelihood Estimation (SMILE), which allows richness optimization while blocking conciseness optimization, thus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
