Fluent and Accurate Image Captioning with a Self-Trained Reward Model
Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

TL;DR
This paper introduces Self-Cap, a novel image captioning method that uses a self-trained, contrastive reward model to generate more descriptive and accurate captions, overcoming limitations of traditional and CLIP-based rewards.
Contribution
The paper presents a learnable, contrastive reward model trained with self-generated negatives, improving caption quality and reducing fine-tuning time compared to traditional metrics.
Findings
Enhanced caption descriptiveness and semantic richness.
Reduced fine-tuning time compared to CIDEr-based methods.
Effective on both standard and zero-shot datasets.
Abstract
Fine-tuning image captioning models with hand-crafted rewards like the CIDEr metric has been a classical strategy for promoting caption quality at the sequence level. This approach, however, is known to limit descriptiveness and semantic richness and tends to drive the model towards the style of ground-truth sentences, thus losing detail and specificity. On the contrary, recent attempts to employ image-text models like CLIP as reward have led to grammatically incorrect and repetitive captions. In this paper, we propose Self-Cap, a captioning approach that relies on a learnable reward model based on self-generated negatives that can discriminate captions based on their consistency with the image. Specifically, our discriminator is a fine-tuned contrastive image-text model trained to promote caption correctness while avoiding the aberrations that typically happen when training with a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
MethodsContrastive Language-Image Pre-training
