DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding
Hao Wu, Zhihang Zhong, Xiao Sun

TL;DR
DIR is a retrieval-augmented image captioning method that improves out-of-domain generalization by using diffusion-guided retrieval enhancement and a rich semantic database, capturing comprehensive visual understanding.
Contribution
The paper introduces DIR, a novel approach combining diffusion-guided retrieval and a semantic-rich database to enhance image captioning beyond domain-specific data.
Findings
Significantly improves out-of-domain captioning performance
Maintains competitive in-domain results
Does not increase inference costs
Abstract
Image captioning models often suffer from performance degradation when applied to novel datasets, as they are typically trained on domain-specific data. To enhance generalization in out-of-domain scenarios, retrieval-augmented approaches have garnered increasing attention. However, current methods face two key challenges: (1) image features used for retrieval are often optimized based on ground-truth (GT) captions, which represent the image from a specific perspective and are influenced by annotator biases, and (2) they underutilize the full potential of retrieved text, typically relying on raw captions or parsed objects, which fail to capture the full semantic richness of the data. In this paper, we propose Dive Into Retrieval (DIR), a method designed to enhance both the image-to-text retrieval process and the utilization of retrieved text to achieve a more comprehensive understanding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
MethodsDiffusion
