TL;DR
This paper introduces a video captioning model that enhances semantic feature extraction, employs scheduled sampling for better training, and adjusts loss functions to generate more comprehensive captions, leading to improved performance on standard datasets.
Contribution
The paper presents a novel semantic detection network, applies scheduled sampling for training, and modifies the loss function to produce longer, more accurate video captions.
Findings
Outperforms previous models on YouTube2Text dataset.
Achieves competitive results on MSR-VTT dataset.
Improves semantic feature quality and caption length.
Abstract
Given the features of a video, recurrent neural networks can be used to automatically generate a caption for the video. Existing methods for video captioning have at least three limitations. First, semantic information has been widely applied to boost the performance of video captioning models, but existing networks often fail to provide meaningful semantic features. Second, the Teacher Forcing algorithm is often utilized to optimize video captioning models, but during training and inference, different strategies are applied to guide word generation, leading to poor performance. Third, current video captioning models are prone to generate relatively short captions that express video contents inappropriately. Toward resolving these three problems, we suggest three corresponding improvements. First of all, we propose a metric to compare the quality of semantic features, and utilize…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
