Group-based Distinctive Image Captioning with Memory Attention
Jiuniu Wang, Wenjia Xu, Qingzhong Wang, Antoni B. Chan

TL;DR
This paper introduces GdisCap, a novel group-based memory attention model that enhances image captioning by emphasizing unique object features within image groups, leading to more distinctive and accurate captions.
Contribution
The paper proposes a group-based memory attention module and a new evaluation metric, DisWordRate, to improve and measure caption distinctiveness in image captioning models.
Findings
Significant improvement in caption distinctiveness and accuracy.
State-of-the-art performance on benchmark datasets.
User study confirms the effectiveness of the new metric.
Abstract
Describing images using natural language is widely known as image captioning, which has made consistent progress due to the development of computer vision and natural language generation techniques. Though conventional captioning models achieve high accuracy based on popular metrics, i.e., BLEU, CIDEr, and SPICE, the ability of captions to distinguish the target image from other similar images is under-explored. To generate distinctive captions, a few pioneers employ contrastive learning or re-weighted the ground-truth captions, which focuses on one single input image. However, the relationships between objects in a similar image group (e.g., items or properties within the same album or fine-grained events) are neglected. In this paper, we improve the distinctiveness of image captions using a Group-based Distinctive Captioning Model (GdisCap), which compares each image with other images…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsContrastive Learning
