Fine-grained Image Captioning with CLIP Reward
Jaemin Cho, Seunghyun Yoon, Ajinkya Kale, Franck Dernoncourt, Trung, Bui, Mohit Bansal

TL;DR
This paper introduces a novel CLIP-based reward for training image captioning models that emphasizes descriptiveness and distinctiveness, eliminating the need for reference captions and improving caption quality.
Contribution
It proposes using CLIP as a reward function for captioning, along with a simple finetuning method for the CLIP text encoder to enhance grammar without extra annotations.
Findings
Generated captions are more distinctive and detailed.
The CLIP-guided model outperforms CIDEr-optimized models in experiments.
Unsupervised grammar finetuning improves caption quality and reduces degeneration.
Abstract
Modern image captioning models are usually trained with text similarity objectives. However, since reference captions in public datasets often describe the most salient common objects, models trained with text similarity objectives tend to ignore specific and detailed aspects of an image that distinguish it from others. Toward more descriptive and distinctive caption generation, we propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function. We also propose a simple finetuning strategy of the CLIP text encoder to improve grammar that does not require extra text annotation. This completely eliminates the need for reference captions during the reward computation. To comprehensively evaluate descriptive captions, we introduce FineCapEval, a new dataset for caption evaluation with fine-grained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
MethodsContrastive Language-Image Pre-training
