Exploring Annotation-free Image Captioning with Retrieval-augmented   Pseudo Sentence Generation

Zhiyuan Li; Dongnan Liu; Heng Wang; Chaoyi Zhang; Weidong; Cai

arXiv:2307.14750·cs.CV·October 15, 2024

Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence Generation

Zhiyuan Li, Dongnan Liu, Heng Wang, Chaoyi Zhang, Weidong, Cai

PDF

Open Access 1 Repo

TL;DR

This paper introduces RaPSG, a retrieval-augmented pseudo sentence generation method that leverages large pre-trained models and retrieval techniques to improve annotation-free image captioning across multiple learning scenarios.

Contribution

It proposes a novel retrieval-augmented approach that distills knowledge from large pre-trained models and retrieves relevant descriptions to generate high-quality pseudo sentences for captioning.

Findings

01

Outperforms state-of-the-art models in zero-shot, unsupervised, semi-supervised, and cross-domain settings.

02

Effectively uses retrieval and filtering to enhance pseudo sentence quality.

03

Demonstrates the effectiveness of combining retrieval with large pre-trained models for annotation-free captioning.

Abstract

Recently, training an image captioner without annotated image-sentence pairs has gained traction. Previous methods have faced limitations due to either using mismatched corpora for inaccurate pseudo annotations or relying on resource-intensive pre-training. To alleviate these challenges, we propose a new strategy where the prior knowledge from large pre-trained models (LPMs) is distilled and leveraged as supervision, and a retrieval process is integrated to further reinforce its effectiveness. Specifically, we introduce Retrieval-augmented Pseudo Sentence Generation (RaPSG), which can efficiently retrieve highly relevant short region descriptions from the mismatching corpora and use them to generate a variety of high-quality pseudo sentences via LPMs. Additionally, we introduce a fluency filter and a CLIP guidance objective to enhance contrastive information learning. Experimental…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhiyuan-li-john/rapsg
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization