Transferable Decoding with Visual Entities for Zero-Shot Image   Captioning

Junjie Fei; Teng Wang; Jinrui Zhang; Zhenyu He; Chengjie Wang; Feng; Zheng

arXiv:2307.16525·cs.CV·August 1, 2023·2 cites

Transferable Decoding with Visual Entities for Zero-Shot Image Captioning

Junjie Fei, Teng Wang, Jinrui Zhang, Zhenyu He, Chengjie Wang, Feng, Zheng

PDF

Open Access 1 Repo 1 Video

TL;DR

ViECap introduces an entity-aware decoding approach for zero-shot image captioning, reducing object hallucination and improving cross-domain transferability by guiding language models to focus on actual visual entities.

Contribution

The paper proposes ViECap, a novel transferable decoding model that uses entity-aware hard prompts to enhance zero-shot image captioning across seen and unseen domains.

Findings

01

Sets new state-of-the-art in cross-domain captioning

02

Maintains performance when transferring from in-domain to out-of-domain scenarios

03

Performs competitively in in-domain captioning

Abstract

Image-to-text generation aims to describe images using natural language. Recently, zero-shot image captioning based on pre-trained vision-language models (VLMs) and large language models (LLMs) has made significant progress. However, we have observed and empirically demonstrated that these methods are susceptible to modality bias induced by LLMs and tend to generate descriptions containing objects (entities) that do not actually exist in the image but frequently appear during training (i.e., object hallucination). In this paper, we propose ViECap, a transferable decoding model that leverages entity-aware decoding to generate descriptions in both seen and unseen scenarios. ViECap incorporates entity-aware hard prompts to guide LLMs' attention toward the visual entities present in the image, enabling coherent caption generation across diverse scenes. With entity-aware hard prompts, ViECap…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

feielysia/viecap
pytorchOfficial

Videos

Transferable Decoding with Visual Entities for Zero-Shot Image Captioning· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques