No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning
Manu Gaur, Darshan Singh, Makarand Tapaswi

TL;DR
This paper improves fine-grained image captioning by enhancing training methods and introducing a new evaluation benchmark, leading to more accurate and detailed captions while addressing limitations of previous self-retrieval approaches.
Contribution
It proposes Visual Caption Boosting and BagCurri to improve fine-grained captioning and introduces TrueMatch as a new benchmark for evaluating subtle visual distinctions.
Findings
Outperforms previous SR fine-tuning methods by +8.9% on SR accuracy.
Achieves +7.6% improvement on ImageCoDe dataset.
State-of-the-art results on TrueMatch benchmark with fewer parameters.
Abstract
Image captioning systems are unable to generate fine-grained captions as they are trained on data that is either noisy (alt-text) or generic (human annotations). This is further exacerbated by maximum likelihood training that encourages generation of frequently occurring phrases. Previous works have tried to address this limitation by fine-tuning captioners with a self-retrieval (SR) reward. However, we find that SR fine-tuning has a tendency to reduce caption faithfulness and even hallucinate. In this work, we circumvent this bottleneck by improving the MLE initialization of the captioning system and designing a curriculum for the SR fine-tuning process. To this extent, we present (1) Visual Caption Boosting, a novel framework to instill fine-grainedness in generic image captioning datasets while remaining anchored in human annotations; and (2) BagCurri, a carefully designed training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
