TL;DR
This paper introduces new benchmarks, a Transformer-based model called TransDIC, and a novel evaluation metric DisCIDEr for reference-based distinctive image captioning, addressing limitations of previous datasets and models.
Contribution
It proposes stricter benchmarks with object-level similarity control, a strong Transformer-based baseline, and a new metric for more reliable evaluation of distinctive captions.
Findings
TransDIC outperforms existing models on new benchmarks.
The new benchmarks ensure models perceive unique objects in images.
DisCIDEr effectively evaluates both accuracy and distinctiveness.
Abstract
Distinctive Image Captioning (DIC) -- generating distinctive captions that describe the unique details of a target image -- has received considerable attention over the last few years. A recent DIC work proposes to generate distinctive captions by comparing the target image with a set of semantic-similar reference images, i.e., reference-based DIC (Ref-DIC). It aims to make the generated captions can tell apart the target and reference images. Unfortunately, reference images used by existing Ref-DIC works are easy to distinguish: these reference images only resemble the target image at scene-level and have few common objects, such that a Ref-DIC model can trivially generate distinctive captions even without considering the reference images. To ensure Ref-DIC models really perceive the unique objects (or attributes) in target images, we first propose two new Ref-DIC benchmarks.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
