DIR: Retrieval-Augmented Image Captioning with Comprehensive   Understanding

Hao Wu; Zhihang Zhong; Xiao Sun

arXiv:2412.01115·cs.CV·December 3, 2024

DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding

Hao Wu, Zhihang Zhong, Xiao Sun

PDF

Open Access

TL;DR

DIR is a retrieval-augmented image captioning method that improves out-of-domain generalization by using diffusion-guided retrieval enhancement and a rich semantic database, capturing comprehensive visual understanding.

Contribution

The paper introduces DIR, a novel approach combining diffusion-guided retrieval and a semantic-rich database to enhance image captioning beyond domain-specific data.

Findings

01

Significantly improves out-of-domain captioning performance

02

Maintains competitive in-domain results

03

Does not increase inference costs

Abstract

Image captioning models often suffer from performance degradation when applied to novel datasets, as they are typically trained on domain-specific data. To enhance generalization in out-of-domain scenarios, retrieval-augmented approaches have garnered increasing attention. However, current methods face two key challenges: (1) image features used for retrieval are often optimized based on ground-truth (GT) captions, which represent the image from a specific perspective and are influenced by annotator biases, and (2) they underutilize the full potential of retrieved text, typically relying on raw captions or parsed objects, which fail to capture the full semantic richness of the data. In this paper, we propose Dive Into Retrieval (DIR), a method designed to enhance both the image-to-text retrieval process and the utilization of retrieved text to achieve a more comprehensive understanding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques

MethodsDiffusion