Fine-grained Image Captioning with CLIP Reward

Jaemin Cho; Seunghyun Yoon; Ajinkya Kale; Franck Dernoncourt; Trung; Bui; Mohit Bansal

arXiv:2205.13115·cs.CL·March 31, 2023

Fine-grained Image Captioning with CLIP Reward

Jaemin Cho, Seunghyun Yoon, Ajinkya Kale, Franck Dernoncourt, Trung, Bui, Mohit Bansal

PDF

Open Access 1 Repo 1 Models 1 Datasets

TL;DR

This paper introduces a novel CLIP-based reward for training image captioning models that emphasizes descriptiveness and distinctiveness, eliminating the need for reference captions and improving caption quality.

Contribution

It proposes using CLIP as a reward function for captioning, along with a simple finetuning method for the CLIP text encoder to enhance grammar without extra annotations.

Findings

01

Generated captions are more distinctive and detailed.

02

The CLIP-guided model outperforms CIDEr-optimized models in experiments.

03

Unsupervised grammar finetuning improves caption quality and reduces degeneration.

Abstract

Modern image captioning models are usually trained with text similarity objectives. However, since reference captions in public datasets often describe the most salient common objects, models trained with text similarity objectives tend to ignore specific and detailed aspects of an image that distinguish it from others. Toward more descriptive and distinctive caption generation, we propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function. We also propose a simple finetuning strategy of the CLIP text encoder to improve grammar that does not require extra text annotation. This completely eliminates the need for reference captions during the reward computation. To comprehensively evaluate descriptive captions, we introduce FineCapEval, a new dataset for caption evaluation with fine-grained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

j-min/clip-caption-reward
pytorchOfficial

Models

🤗
j-min/CLIP-Caption-Reward
model

Datasets

j-min/FineCapEval
dataset· 44 dl
44 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization

MethodsContrastive Language-Image Pre-training