Attention-based transformer models for image captioning across languages: An in-depth survey and evaluation
Israa A. Albadarneh, Bassam H. Hammo, Omar S. Al-Kadi

TL;DR
This paper provides a comprehensive review and evaluation of attention-based transformer models for multilingual image captioning, highlighting current challenges and future research directions in the field.
Contribution
It offers an in-depth survey categorizing models, analyzing datasets and metrics, and discussing limitations and future directions for attention-based multilingual image captioning.
Findings
Transformer models improve caption quality with attention mechanisms.
Multilingual captioning faces challenges like data scarcity and semantic inconsistencies.
Future directions include multimodal learning and real-time applications.
Abstract
Image captioning involves generating textual descriptions from input images, bridging the gap between computer vision and natural language processing. Recent advancements in transformer-based models have significantly improved caption generation by leveraging attention mechanisms for better scene understanding. While various surveys have explored deep learning-based approaches for image captioning, few have comprehensively analyzed attention-based transformer models across multiple languages. This survey reviews attention-based image captioning models, categorizing them into transformer-based, deep learning-based, and hybrid approaches. It explores benchmark datasets, discusses evaluation metrics such as BLEU, METEOR, CIDEr, and ROUGE, and highlights challenges in multilingual captioning. Additionally, this paper identifies key limitations in current models, including semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Attention Is All You Need
