Discriminability objective for training descriptive captions
Ruotian Luo, Brian Price, Scott Cohen, Gregory Shakhnarovich

TL;DR
This paper introduces a discriminability-focused training objective for image captioning models, significantly enhancing their ability to produce captions that distinguish between images, while also improving traditional caption quality metrics.
Contribution
It proposes a novel loss component that directly optimizes for caption discriminability, applicable across various captioning models and loss functions.
Findings
Humans find the generated captions more discriminative.
Standard caption quality scores like BLEU and SPICE improve.
The method is modular and broadly applicable.
Abstract
One property that remains lacking in image captions generated by contemporary methods is discriminability: being able to tell two images apart given the caption for one of them. We propose a way to improve this aspect of caption generation. By incorporating into the captioning training objective a loss component directly related to ability (by a machine) to disambiguate image/caption matches, we obtain systems that produce much more discriminative caption, according to human evaluation. Remarkably, our approach leads to improvement in other aspects of generated captions, reflected by a battery of standard scores such as BLEU, SPICE etc. Our approach is modular and can be applied to a variety of model/loss combinations commonly proposed for image captioning.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
