Using Deep Learning to Generate Semantically Correct Hindi Captions

Wasim Akram Khan; Anil Kumar Vuppala

arXiv:2602.13352·cs.CV·February 17, 2026

Using Deep Learning to Generate Semantically Correct Hindi Captions

Wasim Akram Khan, Anil Kumar Vuppala

PDF

Open Access

TL;DR

This paper presents a deep learning approach for generating semantically accurate Hindi image captions by combining multi-modal architectures, attention mechanisms, and pre-trained CNNs, evaluated with BLEU scores on Flickr8k dataset.

Contribution

It introduces a novel multi-modal architecture with attention and pre-trained CNNs for Hindi image captioning, filling a gap in non-English captioning research.

Findings

01

Attention-based bidirectional LSTM with VGG16 achieved BLEU-1 of 0.59 and BLEU-4 of 0.19.

02

The proposed model produces relevant, semantically accurate Hindi captions.

03

Experiments establish baseline results for Hindi image captioning.

Abstract

Automated image captioning using the content from the image is very appealing when done by harnessing the capability of computer vision and natural language processing. Extensive research has been done in the field with a major focus on the English language which gives the scope for further developments in the same with consideration of popular foreign languages. This research utilizes distinct models for translating the image caption into Hindi, the fourth most popular language across the world. Exploring the multi-modal architectures this research comprises local visual features, global visual features, attention mechanisms, and pre-trained models. Using google cloud translator on the image dataset from Flickr8k, Hindi image descriptions have been generated. Pre-trained CNNs like VGG16, ResNet50, and Inception V3 helped in retrieving image characteristics, while the uni-directional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Handwritten Text Recognition Techniques