Beyond RNNs: Benchmarking Attention-Based Image Captioning Models
Hemanth Teja Yanambakkam, Rahul Chinthala

TL;DR
This paper benchmarks attention-based image captioning models against RNN-based approaches, demonstrating that attention mechanisms significantly improve caption quality and alignment with human judgments on the MS-COCO dataset.
Contribution
It provides a comprehensive comparison of attention-based and RNN-based models for image captioning, highlighting the effectiveness of attention mechanisms in improving caption accuracy and semantic richness.
Findings
Attention models outperform RNNs in caption quality.
Attention improves alignment with human evaluations.
Models achieve higher scores on BLEU, METEOR, GLEU, and WER metrics.
Abstract
Image captioning is a challenging task at the intersection of computer vision and natural language processing, requiring models to generate meaningful textual descriptions of images. Traditional approaches rely on recurrent neural networks (RNNs), but recent advancements in attention mechanisms have demonstrated significant improvements. This study benchmarks the performance of attention-based image captioning models against RNN-based approaches using the MS-COCO dataset. We evaluate the effectiveness of Bahdanau attention in enhancing the alignment between image features and generated captions. The models are assessed using natural language processing metrics such as BLEU, METEOR, GLEU, and WER. Our results show that attention-based models outperform RNNs in generating more accurate and semantically rich captions, with better alignment to human evaluation. This work provides insights…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
