MeaCap: Memory-Augmented Zero-shot Image Captioning

Zequn Zeng; Yan Xie; Hao Zhang; Chiyu Chen; Zhengjue Wang; Bo Chen

arXiv:2403.03715·cs.CV·March 7, 2024·2 cites

MeaCap: Memory-Augmented Zero-shot Image Captioning

Zequn Zeng, Yan Xie, Hao Zhang, Chiyu Chen, Zhengjue Wang, Bo Chen

PDF

Open Access 1 Repo

TL;DR

MeaCap is a novel memory-augmented framework for zero-shot image captioning that enhances caption relevance and reduces hallucinations by retrieving and filtering key concepts related to the image before generating captions.

Contribution

The paper introduces MeaCap, which incorporates a textual memory and a retrieve-then-filter module to improve zero-shot image captioning performance and caption quality.

Findings

01

Achieves state-of-the-art results on zero-shot IC benchmarks.

02

Generates more accurate, concept-centered captions with fewer hallucinations.

03

Effectively integrates memory-augmented retrieval with language modeling.

Abstract

Zero-shot image captioning (IC) without well-paired image-text data can be divided into two categories, training-free and text-only-training. Generally, these two types of methods realize zero-shot IC by integrating pretrained vision-language models like CLIP for image-text similarity evaluation and a pre-trained language model (LM) for caption generation. The main difference between them is whether using a textual corpus to train the LM. Though achieving attractive performance w.r.t. some metrics, existing methods often exhibit some common drawbacks. Training-free methods tend to produce hallucinations, while text-only-training often lose generalization capability. To move forward, in this paper, we propose a novel Memory-Augmented zero-shot image Captioning framework (MeaCap). Specifically, equipped with a textual memory, we introduce a retrieve-then-filter module to get key concepts…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

joeyz0z/meacap
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition

MethodsContrastive Language-Image Pre-training