Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence Generation
Zhiyuan Li, Dongnan Liu, Heng Wang, Chaoyi Zhang, Weidong, Cai

TL;DR
This paper introduces RaPSG, a retrieval-augmented pseudo sentence generation method that leverages large pre-trained models and retrieval techniques to improve annotation-free image captioning across multiple learning scenarios.
Contribution
It proposes a novel retrieval-augmented approach that distills knowledge from large pre-trained models and retrieves relevant descriptions to generate high-quality pseudo sentences for captioning.
Findings
Outperforms state-of-the-art models in zero-shot, unsupervised, semi-supervised, and cross-domain settings.
Effectively uses retrieval and filtering to enhance pseudo sentence quality.
Demonstrates the effectiveness of combining retrieval with large pre-trained models for annotation-free captioning.
Abstract
Recently, training an image captioner without annotated image-sentence pairs has gained traction. Previous methods have faced limitations due to either using mismatched corpora for inaccurate pseudo annotations or relying on resource-intensive pre-training. To alleviate these challenges, we propose a new strategy where the prior knowledge from large pre-trained models (LPMs) is distilled and leveraged as supervision, and a retrieval process is integrated to further reinforce its effectiveness. Specifically, we introduce Retrieval-augmented Pseudo Sentence Generation (RaPSG), which can efficiently retrieve highly relevant short region descriptions from the mismatching corpora and use them to generate a variety of high-quality pseudo sentences via LPMs. Additionally, we introduce a fluency filter and a CLIP guidance objective to enhance contrastive information learning. Experimental…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
