Using Deep Learning to Generate Semantically Correct Hindi Captions
Wasim Akram Khan, Anil Kumar Vuppala

TL;DR
This paper presents a deep learning approach for generating semantically accurate Hindi image captions by combining multi-modal architectures, attention mechanisms, and pre-trained CNNs, evaluated with BLEU scores on Flickr8k dataset.
Contribution
It introduces a novel multi-modal architecture with attention and pre-trained CNNs for Hindi image captioning, filling a gap in non-English captioning research.
Findings
Attention-based bidirectional LSTM with VGG16 achieved BLEU-1 of 0.59 and BLEU-4 of 0.19.
The proposed model produces relevant, semantically accurate Hindi captions.
Experiments establish baseline results for Hindi image captioning.
Abstract
Automated image captioning using the content from the image is very appealing when done by harnessing the capability of computer vision and natural language processing. Extensive research has been done in the field with a major focus on the English language which gives the scope for further developments in the same with consideration of popular foreign languages. This research utilizes distinct models for translating the image caption into Hindi, the fourth most popular language across the world. Exploring the multi-modal architectures this research comprises local visual features, global visual features, attention mechanisms, and pre-trained models. Using google cloud translator on the image dataset from Flickr8k, Hindi image descriptions have been generated. Pre-trained CNNs like VGG16, ResNet50, and Inception V3 helped in retrieving image characteristics, while the uni-directional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Handwritten Text Recognition Techniques
