Delving Deeper into the Decoder for Video Captioning
Haoran Chen, Jianmin Li, Xiaolin Hu

TL;DR
This paper investigates the decoder component in video captioning models, proposing three techniques—variational dropout with layer normalization, an online validation evaluation method, and a professional learning strategy—to significantly improve performance on benchmark datasets.
Contribution
The paper introduces a comprehensive analysis of the decoder in video captioning, proposing novel techniques that enhance model robustness and training efficiency, leading to state-of-the-art results.
Findings
Achieved top performance on MSVD and MSR-VTT datasets.
Significant improvements in BLEU, CIDEr, METEOR, and ROUGE-L metrics.
Up to 18% and 3.5% gains over previous models.
Abstract
Video captioning is an advanced multi-modal task which aims to describe a video clip using a natural language sentence. The encoder-decoder framework is the most popular paradigm for this task in recent years. However, there exist some problems in the decoder of a video captioning model. We make a thorough investigation into the decoder and adopt three techniques to improve the performance of the model. First of all, a combination of variational dropout and layer normalization is embedded into a recurrent unit to alleviate the problem of overfitting. Secondly, a new online method is proposed to evaluate the performance of a model on a validation set so as to select the best checkpoint for testing. Finally, a new training strategy called professional learning is proposed which uses the strengths of a captioning model and bypasses its weaknesses. It is demonstrated in the experiments on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
MethodsVariational Dropout · Dropout · Layer Normalization
