Nepali Video Captioning using CNN-RNN Architecture
Bipesh Subedi, Saugat Singh, Bal Krishna Bal

TL;DR
This study develops a deep learning approach for Nepali video captioning by combining CNNs and RNNs, achieving promising results and providing a foundation for future research in low-resource language video understanding.
Contribution
It introduces a novel Nepali video captioning dataset and evaluates multiple CNN-RNN architectures, highlighting the most effective model for this language.
Findings
EfficientNetB0 + BiLSTM achieved BLEU-4 of 17
Enriched MSVD dataset with Nepali captions via Google Translate
Identified challenges and future directions for Nepali video captioning
Abstract
This article presents a study on Nepali video captioning using deep neural networks. Through the integration of pre-trained CNNs and RNNs, the research focuses on generating precise and contextually relevant captions for Nepali videos. The approach involves dataset collection, data preprocessing, model implementation, and evaluation. By enriching the MSVD dataset with Nepali captions via Google Translate, the study trains various CNN-RNN architectures. The research explores the effectiveness of CNNs (e.g., EfficientNetB0, ResNet101, VGG16) paired with different RNN decoders like LSTM, GRU, and BiLSTM. Evaluation involves BLEU and METEOR metrics, with the best model being EfficientNetB0 + BiLSTM with 1024 hidden dimensions, achieving a BLEU-4 score of 17 and METEOR score of 46. The article also outlines challenges and future directions for advancing Nepali video captioning, offering a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Digital Imaging for Blood Diseases
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory · Gated Recurrent Unit · Bidirectional LSTM
