Beyond Generic: Enhancing Image Captioning with Real-World Knowledge   using Vision-Language Pre-Training Model

Kanzhi Cheng; Wenpo Song; Zheng Ma; Wenhao Zhu; Zixuan Zhu; Jianbing; Zhang

arXiv:2308.01126·cs.CV·August 3, 2023

Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training Model

Kanzhi Cheng, Wenpo Song, Zheng Ma, Wenhao Zhu, Zixuan Zhu, Jianbing, Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces K-Replay, a method that enhances image captioning by integrating real-world knowledge from Vision-Language Pre-Training models, significantly improving knowledge accuracy and description quality.

Contribution

It proposes K-Replay, a novel approach combining knowledge prediction and distillation to retain and utilize pre-training knowledge during fine-tuning for better image captioning.

Findings

01

Outperforms baseline by 20.9 CIDEr points

02

Achieves 54.5% knowledge recognition accuracy

03

Creates a new benchmark dataset KnowCap

Abstract

Current captioning approaches tend to generate correct but "generic" descriptions that lack real-world knowledge, e.g., named entities and contextual information. Considering that Vision-Language Pre-Training (VLP) models master massive such knowledge from large-scale web-harvested data, it is promising to utilize the generalizability of VLP models to incorporate knowledge into image descriptions. However, using VLP models faces challenges: zero-shot inference suffers from knowledge hallucination that leads to low-quality descriptions, but the generic bias in downstream task fine-tuning hinders the VLP model from expressing knowledge. To address these concerns, we propose a simple yet effective method called Knowledge-guided Replay (K-Replay), which enables the retention of pre-training knowledge during fine-tuning. Our approach consists of two parts: (1) a knowledge prediction task on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

njucckevin/knowcap
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling

MethodsKnowledge Distillation