Beyond RNNs: Benchmarking Attention-Based Image Captioning Models

Hemanth Teja Yanambakkam; Rahul Chinthala

arXiv:2502.18734·cs.CV·February 27, 2025

Beyond RNNs: Benchmarking Attention-Based Image Captioning Models

Hemanth Teja Yanambakkam, Rahul Chinthala

PDF

Open Access

TL;DR

This paper benchmarks attention-based image captioning models against RNN-based approaches, demonstrating that attention mechanisms significantly improve caption quality and alignment with human judgments on the MS-COCO dataset.

Contribution

It provides a comprehensive comparison of attention-based and RNN-based models for image captioning, highlighting the effectiveness of attention mechanisms in improving caption accuracy and semantic richness.

Findings

01

Attention models outperform RNNs in caption quality.

02

Attention improves alignment with human evaluations.

03

Models achieve higher scores on BLEU, METEOR, GLEU, and WER metrics.

Abstract

Image captioning is a challenging task at the intersection of computer vision and natural language processing, requiring models to generate meaningful textual descriptions of images. Traditional approaches rely on recurrent neural networks (RNNs), but recent advancements in attention mechanisms have demonstrated significant improvements. This study benchmarks the performance of attention-based image captioning models against RNN-based approaches using the MS-COCO dataset. We evaluate the effectiveness of Bahdanau attention in enhancing the alignment between image features and generated captions. The models are assessed using natural language processing metrics such as BLEU, METEOR, GLEU, and WER. Our results show that attention-based models outperform RNNs in generating more accurate and semantically rich captions, with better alignment to human evaluation. This work provides insights…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling