Transferable Decoding with Visual Entities for Zero-Shot Image Captioning
Junjie Fei, Teng Wang, Jinrui Zhang, Zhenyu He, Chengjie Wang, Feng, Zheng

TL;DR
ViECap introduces an entity-aware decoding approach for zero-shot image captioning, reducing object hallucination and improving cross-domain transferability by guiding language models to focus on actual visual entities.
Contribution
The paper proposes ViECap, a novel transferable decoding model that uses entity-aware hard prompts to enhance zero-shot image captioning across seen and unseen domains.
Findings
Sets new state-of-the-art in cross-domain captioning
Maintains performance when transferring from in-domain to out-of-domain scenarios
Performs competitively in in-domain captioning
Abstract
Image-to-text generation aims to describe images using natural language. Recently, zero-shot image captioning based on pre-trained vision-language models (VLMs) and large language models (LLMs) has made significant progress. However, we have observed and empirically demonstrated that these methods are susceptible to modality bias induced by LLMs and tend to generate descriptions containing objects (entities) that do not actually exist in the image but frequently appear during training (i.e., object hallucination). In this paper, we propose ViECap, a transferable decoding model that leverages entity-aware decoding to generate descriptions in both seen and unseen scenarios. ViECap incorporates entity-aware hard prompts to guide LLMs' attention toward the visual entities present in the image, enabling coherent caption generation across diverse scenes. With entity-aware hard prompts, ViECap…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Transferable Decoding with Visual Entities for Zero-Shot Image Captioning· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
