Bangla Image Caption Generation through CNN-Transformer based Encoder-Decoder Network
Md Aminul Haque Palash, MD Abdullah Al Nasim, Sourav Saha, Faria, Afrin, Raisa Mallik, Sathishkumar Samiappan

TL;DR
This paper introduces a novel transformer-based encoder-decoder model with a ResNet-101 image encoder for Bangla image captioning, achieving state-of-the-art results on the BanglaLekhaImageCaptions dataset.
Contribution
It presents a new architecture combining CNN and transformer models for improved Bangla image captioning performance.
Findings
Outperforms existing Bengali captioning models
Achieves new benchmark scores on BLEU and METEOR metrics
Captures fine-grained image details in captions
Abstract
Automatic Image Captioning is the never-ending effort of creating syntactically and validating the accuracy of textual descriptions of an image in natural language with context. The encoder-decoder structure used throughout existing Bengali Image Captioning (BIC) research utilized abstract image feature vectors as the encoder's input. We propose a novel transformer-based architecture with an attention mechanism with a pre-trained ResNet-101 model image encoder for feature extraction from images. Experiments demonstrate that the language decoder in our technique captures fine-grained information in the caption and, then paired with image features, produces accurate and diverse captions on the BanglaLekhaImageCaptions dataset. Our approach outperforms all existing Bengali Image Captioning work and sets a new benchmark by scoring 0.694 on BLEU-1, 0.630 on BLEU-2, 0.582 on BLEU-3, and 0.337…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques
