ViPCap: Retrieval Text-Based Visual Prompts for Lightweight Image   Captioning

Taewhan Kim; Soeun Lee; Si-Woo Kim; Dong-Jin Kim

arXiv:2412.19289·cs.CV·January 27, 2025

ViPCap: Retrieval Text-Based Visual Prompts for Lightweight Image Captioning

Taewhan Kim, Soeun Lee, Si-Woo Kim, Dong-Jin Kim

PDF

Open Access 1 Repo

TL;DR

ViPCap introduces a retrieval-based visual prompt method that enhances lightweight image captioning by integrating retrieved text and image information, significantly improving performance on standard datasets.

Contribution

The paper proposes ViPCap, a novel retrieval text-based visual prompt that combines retrieved text and image features to improve lightweight image captioning.

Findings

01

Outperforms prior models on COCO, Flickr30k, and NoCaps datasets.

02

Enhances model efficiency and effectiveness.

03

Provides a plug-and-play solution for captioning tasks.

Abstract

Recent lightweight image captioning models using retrieved data mainly focus on text prompts. However, previous works only utilize the retrieved text as text prompts, and the visual information relies only on the CLIP visual embedding. Because of this issue, there is a limitation that the image descriptions inherent in the prompt are not sufficiently reflected in the visual embedding space. To tackle this issue, we propose ViPCap, a novel retrieval text-based visual prompt for lightweight image captioning. ViPCap leverages the retrieved text with image information as visual prompts to enhance the ability of the model to capture relevant visual information. By mapping text prompts into the CLIP space and generating multiple randomized Gaussian distributions, our method leverages sampling to explore randomly augmented distributions and effectively retrieves the semantic features that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

taewhankim/vipcap
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques

MethodsContrastive Language-Image Pre-training · Focus