Distinctive Image Captioning via CLIP Guided Group Optimization

Youyuan Zhang; Jiuniu Wang; Hao Wu; Wenjia Xu

arXiv:2208.04254·cs.CV·August 30, 2022

Distinctive Image Captioning via CLIP Guided Group Optimization

Youyuan Zhang, Jiuniu Wang, Hao Wu, Wenjia Xu

PDF

Open Access

TL;DR

This paper introduces a novel approach to generate distinctive image captions by leveraging CLIP-guided group optimization, improving caption uniqueness and differentiation from similar images, validated through new metrics and extensive experiments.

Contribution

Proposes a simple training strategy using CLIP to enhance caption distinctiveness by comparing images within groups, achieving state-of-the-art results.

Findings

01

New metrics effectively quantify caption distinctiveness.

02

The proposed method improves caption uniqueness across various models.

03

Achieves state-of-the-art performance on distinctiveness benchmarks.

Abstract

Image captioning models are usually trained according to human annotated ground-truth captions, which could generate accurate but generic captions. In this paper, we focus on generating distinctive captions that can distinguish the target image from other similar images. To evaluate the distinctiveness of captions, we introduce a series of metrics that use large-scale vision-language pre-training model CLIP to quantify the distinctiveness. To further improve the distinctiveness of captioning models, we propose a simple and effective training strategy that trains the model by comparing target image with similar image group and optimizing the group embedding gap. Extensive experiments are conducted on various baseline models to demonstrate the wide applicability of our strategy and the consistency of metric results with human evaluation. By comparing the performance of our best model with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition

MethodsContrastive Language-Image Pre-training