Beam-Guided Knowledge Replay for Knowledge-Rich Image Captioning using Vision-Language Model

Reem AlJunaid; Muzammil Behzad

arXiv:2505.23358·cs.CV·May 30, 2025

Beam-Guided Knowledge Replay for Knowledge-Rich Image Captioning using Vision-Language Model

Reem AlJunaid, Muzammil Behzad

PDF

TL;DR

This paper introduces KRCapVLM, a vision-language model that enhances knowledge-rich image captioning by using beam-guided decoding, attention modules, and training schedulers to produce more informative, diverse, and contextually relevant captions.

Contribution

The paper presents a novel framework combining knowledge replay, beam search, attention modules, and training schedulers for improved knowledge-rich image captioning.

Findings

01

Significant improvements in caption quality and knowledge recognition accuracy.

02

Enhanced ability to generalize to unseen knowledge concepts.

03

More informative and contextually relevant image descriptions.

Abstract

Generating informative and knowledge-rich image captions remains a challenge for many existing captioning models, which often produce generic descriptions that lack specificity and contextual depth. To address this limitation, we propose KRCapVLM, a knowledge replay-based novel image captioning framework using vision-language model. We incorporate beam search decoding to generate more diverse and coherent captions. We also integrate attention-based modules into the image encoder to enhance feature representation. Finally, we employ training schedulers to improve stability and ensure smoother convergence during training. These proposals accelerate substantial gains in both caption quality and knowledge recognition. Our proposed model demonstrates clear improvements in both the accuracy of knowledge recognition and the overall quality of generated captions. It shows a stronger ability to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.