AGIC: Attention-Guided Image Captioning to Improve Caption Relevance

L. D. M. S. Sai Teja; Ashok Urlana; Pruthwik Mishra

arXiv:2508.06853·cs.CV·August 12, 2025

AGIC: Attention-Guided Image Captioning to Improve Caption Relevance

L. D. M. S. Sai Teja, Ashok Urlana, Pruthwik Mishra

PDF

Open Access 1 Video

TL;DR

AGIC introduces an attention-guided approach with hybrid decoding to enhance caption relevance, achieving state-of-the-art results with faster inference in image captioning tasks.

Contribution

This paper presents a novel attention-guided method and hybrid decoding strategy that improve caption relevance and inference speed in image captioning.

Findings

01

AGIC outperforms several state-of-the-art models.

02

AGIC achieves faster inference times.

03

Strong performance across multiple metrics.

Abstract

Despite significant progress in image captioning, generating accurate and descriptive captions remains a long-standing challenge. In this study, we propose Attention-Guided Image Captioning (AGIC), which amplifies salient visual regions directly in the feature space to guide caption generation. We further introduce a hybrid decoding strategy that combines deterministic and probabilistic sampling to balance fluency and diversity. To evaluate AGIC, we conduct extensive experiments on the Flickr8k and Flickr30k datasets. The results show that AGIC matches or surpasses several state-of-the-art models while achieving faster inference. Moreover, AGIC demonstrates strong performance across multiple evaluation metrics, offering a scalable and interpretable solution for image captioning.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

AGIC: Attention-Guided Image Captioning to Improve Caption Relevance· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Generative Adversarial Networks and Image Synthesis