Distinctive Image Captioning via CLIP Guided Group Optimization
Youyuan Zhang, Jiuniu Wang, Hao Wu, Wenjia Xu

TL;DR
This paper introduces a novel approach to generate distinctive image captions by leveraging CLIP-guided group optimization, improving caption uniqueness and differentiation from similar images, validated through new metrics and extensive experiments.
Contribution
Proposes a simple training strategy using CLIP to enhance caption distinctiveness by comparing images within groups, achieving state-of-the-art results.
Findings
New metrics effectively quantify caption distinctiveness.
The proposed method improves caption uniqueness across various models.
Achieves state-of-the-art performance on distinctiveness benchmarks.
Abstract
Image captioning models are usually trained according to human annotated ground-truth captions, which could generate accurate but generic captions. In this paper, we focus on generating distinctive captions that can distinguish the target image from other similar images. To evaluate the distinctiveness of captions, we introduce a series of metrics that use large-scale vision-language pre-training model CLIP to quantify the distinctiveness. To further improve the distinctiveness of captioning models, we propose a simple and effective training strategy that trains the model by comparing target image with similar image group and optimizing the group embedding gap. Extensive experiments are conducted on various baseline models to demonstrate the wide applicability of our strategy and the consistency of metric results with human evaluation. By comparing the performance of our best model with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
MethodsContrastive Language-Image Pre-training
