No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image   Captioning

Manu Gaur; Darshan Singh; Makarand Tapaswi

arXiv:2409.03025·cs.CV·April 10, 2025

No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning

Manu Gaur, Darshan Singh, Makarand Tapaswi

PDF

Open Access 1 Models

TL;DR

This paper improves fine-grained image captioning by enhancing training methods and introducing a new evaluation benchmark, leading to more accurate and detailed captions while addressing limitations of previous self-retrieval approaches.

Contribution

It proposes Visual Caption Boosting and BagCurri to improve fine-grained captioning and introduces TrueMatch as a new benchmark for evaluating subtle visual distinctions.

Findings

01

Outperforms previous SR fine-tuning methods by +8.9% on SR accuracy.

02

Achieves +7.6% improvement on ImageCoDe dataset.

03

State-of-the-art results on TrueMatch benchmark with fewer parameters.

Abstract

Image captioning systems are unable to generate fine-grained captions as they are trained on data that is either noisy (alt-text) or generic (human annotations). This is further exacerbated by maximum likelihood training that encourages generation of frequently occurring phrases. Previous works have tried to address this limitation by fine-tuning captioners with a self-retrieval (SR) reward. However, we find that SR fine-tuning has a tendency to reduce caption faithfulness and even hallucinate. In this work, we circumvent this bottleneck by improving the MLE initialization of the captioning system and designing a curriculum for the SR fine-tuning process. To this extent, we present (1) Visual Caption Boosting, a novel framework to instill fine-grainedness in generic image captioning datasets while remaining anchored in human annotations; and (2) BagCurri, a carefully designed training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
ariG23498/NDLB
model· 1 dl· ♡ 1
1 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization