An Ensemble Model with Attention Based Mechanism for Image Captioning
Israa Al Badarneh, Bassam Hammo, Omar Al-Kadi

TL;DR
This paper introduces an ensemble transformer-based model with attention mechanisms for image captioning, achieving state-of-the-art results on Flickr8K and Flickr30K datasets by combining multiple neural networks and selecting captions based on BLEU scores.
Contribution
It proposes a novel ensemble learning framework that enhances image captioning quality by integrating multiple deep neural networks with a voting mechanism based on BLEU scores.
Findings
Achieved highest BLEU scores on Flickr8K and Flickr30K datasets.
Outperformed recent methods in image captioning benchmarks.
Demonstrated the effectiveness of ensemble learning in improving caption quality.
Abstract
Image captioning creates informative text from an input image by creating a relationship between the words and the actual content of an image. Recently, deep learning models that utilize transformers have been the most successful in automatically generating image captions. The capabilities of transformer networks have led to notable progress in several activities related to vision. In this paper, we thoroughly examine transformer models, emphasizing the critical role that attention mechanisms play. The proposed model uses a transformer encoder-decoder architecture to create textual captions and a deep learning convolutional neural network to extract features from the images. To create the captions, we present a novel ensemble learning framework that improves the richness of the generated captions by utilizing several deep neural network architectures based on a voting mechanism that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Attention Is All You Need
