ViPCap: Retrieval Text-Based Visual Prompts for Lightweight Image Captioning
Taewhan Kim, Soeun Lee, Si-Woo Kim, Dong-Jin Kim

TL;DR
ViPCap introduces a retrieval-based visual prompt method that enhances lightweight image captioning by integrating retrieved text and image information, significantly improving performance on standard datasets.
Contribution
The paper proposes ViPCap, a novel retrieval text-based visual prompt that combines retrieved text and image features to improve lightweight image captioning.
Findings
Outperforms prior models on COCO, Flickr30k, and NoCaps datasets.
Enhances model efficiency and effectiveness.
Provides a plug-and-play solution for captioning tasks.
Abstract
Recent lightweight image captioning models using retrieved data mainly focus on text prompts. However, previous works only utilize the retrieved text as text prompts, and the visual information relies only on the CLIP visual embedding. Because of this issue, there is a limitation that the image descriptions inherent in the prompt are not sufficiently reflected in the visual embedding space. To tackle this issue, we propose ViPCap, a novel retrieval text-based visual prompt for lightweight image captioning. ViPCap leverages the retrieved text with image information as visual prompts to enhance the ability of the model to capture relevant visual information. By mapping text prompts into the CLIP space and generating multiple randomized Gaussian distributions, our method leverages sampling to explore randomly augmented distributions and effectively retrieves the semantic features that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
MethodsContrastive Language-Image Pre-training · Focus
