Improving Reference-based Distinctive Image Captioning with Contrastive   Rewards

Yangjun Mao; Jun Xiao; Dong Zhang; Meng Cao; Jian Shao; Yueting; Zhuang; Long Chen

arXiv:2306.14259·cs.CV·June 27, 2023·2 cites

Improving Reference-based Distinctive Image Captioning with Contrastive Rewards

Yangjun Mao, Jun Xiao, Dong Zhang, Meng Cao, Jian Shao, Yueting, Zhuang, Long Chen

PDF

Open Access

TL;DR

This paper introduces a new contrastive learning module and evaluation metric for reference-based distinctive image captioning, improving the ability to generate unique, accurate captions that highlight image-specific details.

Contribution

It proposes a contrastive learning module integrated into Transformer-based models and introduces new benchmarks and metrics for more effective evaluation of distinctive captioning.

Findings

01

TransDIC++ outperforms state-of-the-art models on new benchmarks

02

The contrastive module enhances the perception of unique image attributes

03

DisCIDEr metric effectively evaluates caption accuracy and distinctiveness

Abstract

Distinctive Image Captioning (DIC) -- generating distinctive captions that describe the unique details of a target image -- has received considerable attention over the last few years. A recent DIC method proposes to generate distinctive captions by comparing the target image with a set of semantic-similar reference images, i.e., reference-based DIC (Ref-DIC). It aims to force the generated captions to distinguish between the target image and the reference image. To ensure Ref-DIC models really perceive the unique objects (or attributes) in target images, we propose two new Ref-DIC benchmarks and develop a Transformer-based Ref-DIC baseline TransDIC. The model only extracts visual features from the target image, but also encodes the differences between objects in the target and reference images. Taking one step further, we propose a stronger TransDIC++, which consists of an extra…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization

MethodsContrastive Learning