Encoder-Decoder Based Long Short-Term Memory (LSTM) Model for Video Captioning
Sikiru Adewale, Tosin Ige, Bolanle Hafiz Matti

TL;DR
This paper presents an encoder-decoder LSTM model for video captioning that maps sequences of video frames to descriptive text, demonstrating generalization across diverse scenes and actions with evaluation using BLEU scores.
Contribution
It introduces an encoder-decoder LSTM framework for video captioning, detailing data preprocessing, model training, and architecture improvements for better caption quality.
Findings
Model achieves reasonable BLEU scores on dataset splits.
Captions generalize well over different video actions.
Model handles scene changes effectively.
Abstract
This work demonstrates the implementation and use of an encoder-decoder model to perform a many-to-many mapping of video data to text captions. The many-to-many mapping occurs via an input temporal sequence of video frames to an output sequence of words to form a caption sentence. Data preprocessing, model construction, and model training are discussed. Caption correctness is evaluated using 2-gram BLEU scores across the different splits of the dataset. Specific examples of output captions were shown to demonstrate model generality over the video temporal dimension. Predicted captions were shown to generalize over video action, even in instances where the video scene changed dramatically. Model architecture changes are discussed to improve sentence grammar and correctness.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
