MeaCap: Memory-Augmented Zero-shot Image Captioning
Zequn Zeng, Yan Xie, Hao Zhang, Chiyu Chen, Zhengjue Wang, Bo Chen

TL;DR
MeaCap is a novel memory-augmented framework for zero-shot image captioning that enhances caption relevance and reduces hallucinations by retrieving and filtering key concepts related to the image before generating captions.
Contribution
The paper introduces MeaCap, which incorporates a textual memory and a retrieve-then-filter module to improve zero-shot image captioning performance and caption quality.
Findings
Achieves state-of-the-art results on zero-shot IC benchmarks.
Generates more accurate, concept-centered captions with fewer hallucinations.
Effectively integrates memory-augmented retrieval with language modeling.
Abstract
Zero-shot image captioning (IC) without well-paired image-text data can be divided into two categories, training-free and text-only-training. Generally, these two types of methods realize zero-shot IC by integrating pretrained vision-language models like CLIP for image-text similarity evaluation and a pre-trained language model (LM) for caption generation. The main difference between them is whether using a textual corpus to train the LM. Though achieving attractive performance w.r.t. some metrics, existing methods often exhibit some common drawbacks. Training-free methods tend to produce hallucinations, while text-only-training often lose generalization capability. To move forward, in this paper, we propose a novel Memory-Augmented zero-shot image Captioning framework (MeaCap). Specifically, equipped with a textual memory, we introduce a retrieve-then-filter module to get key concepts…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition
MethodsContrastive Language-Image Pre-training
