Negative Entity Suppression for Zero-Shot Captioning with Synthetic Images
Zimao Lu, Hui Xu, Bing Liu, Ke Wang

TL;DR
This paper introduces Negative Entity Suppression (NES), a novel method for zero-shot image captioning that reduces hallucinations and improves cross-domain generalization by filtering and suppressing irrelevant entities in generated captions.
Contribution
The paper proposes NES, a three-stage approach utilizing synthetic images and attention suppression to mitigate hallucinations in zero-shot captioning, advancing state-of-the-art performance.
Findings
Improves cross-domain transfer in zero-shot captioning.
Reduces hallucination rates significantly.
Maintains competitive in-domain performance.
Abstract
Text-only training provides an attractive approach to address data scarcity challenges in zero-shot image captioning (ZIC), avoiding the expense of collecting paired image-text annotations. However, although these approaches perform well within training domains, they suffer from poor cross-domain generalization, often producing hallucinated content when encountering novel visual environments. Retrieval-based methods attempt to mitigate this limitation by leveraging external knowledge, but they can paradoxically exacerbate hallucination when retrieved captions contain entities irrelevant to the inputs. We introduce the concept of negative entities--objects that appear in generated caption but are absent from the input--and propose Negative Entity Suppression (NES) to tackle this challenge. NES seamlessly integrates three stages: (1) it employs synthetic images to ensure consistent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
