Fluent and Accurate Image Captioning with a Self-Trained Reward Model

Nicholas Moratelli; Marcella Cornia; Lorenzo Baraldi; Rita Cucchiara

arXiv:2408.16827·cs.CV·September 2, 2024

Fluent and Accurate Image Captioning with a Self-Trained Reward Model

Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

PDF

Open Access

TL;DR

This paper introduces Self-Cap, a novel image captioning method that uses a self-trained, contrastive reward model to generate more descriptive and accurate captions, overcoming limitations of traditional and CLIP-based rewards.

Contribution

The paper presents a learnable, contrastive reward model trained with self-generated negatives, improving caption quality and reducing fine-tuning time compared to traditional metrics.

Findings

01

Enhanced caption descriptiveness and semantic richness.

02

Reduced fine-tuning time compared to CIDEr-based methods.

03

Effective on both standard and zero-shot datasets.

Abstract

Fine-tuning image captioning models with hand-crafted rewards like the CIDEr metric has been a classical strategy for promoting caption quality at the sequence level. This approach, however, is known to limit descriptiveness and semantic richness and tends to drive the model towards the style of ground-truth sentences, thus losing detail and specificity. On the contrary, recent attempts to employ image-text models like CLIP as reward have led to grammatically incorrect and repetitive captions. In this paper, we propose Self-Cap, a captioning approach that relies on a learnable reward model based on self-generated negatives that can discriminate captions based on their consistency with the image. Specifically, our discriminator is a fine-tuned contrastive image-text model trained to promote caption correctness while avoiding the aberrations that typically happen when training with a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization

MethodsContrastive Language-Image Pre-training