Delving Deeper into the Decoder for Video Captioning

Haoran Chen; Jianmin Li; Xiaolin Hu

arXiv:2001.05614·cs.CV·February 15, 2021·19 cites

Delving Deeper into the Decoder for Video Captioning

Haoran Chen, Jianmin Li, Xiaolin Hu

PDF

Open Access 1 Repo

TL;DR

This paper investigates the decoder component in video captioning models, proposing three techniques—variational dropout with layer normalization, an online validation evaluation method, and a professional learning strategy—to significantly improve performance on benchmark datasets.

Contribution

The paper introduces a comprehensive analysis of the decoder in video captioning, proposing novel techniques that enhance model robustness and training efficiency, leading to state-of-the-art results.

Findings

01

Achieved top performance on MSVD and MSR-VTT datasets.

02

Significant improvements in BLEU, CIDEr, METEOR, and ROUGE-L metrics.

03

Up to 18% and 3.5% gains over previous models.

Abstract

Video captioning is an advanced multi-modal task which aims to describe a video clip using a natural language sentence. The encoder-decoder framework is the most popular paradigm for this task in recent years. However, there exist some problems in the decoder of a video captioning model. We make a thorough investigation into the decoder and adopt three techniques to improve the performance of the model. First of all, a combination of variational dropout and layer normalization is embedded into a recurrent unit to alleviate the problem of overfitting. Secondly, a new online method is proposed to evaluate the performance of a model on a validation set so as to select the best checkpoint for testing. Finally, a new training strategy called professional learning is proposed which uses the strengths of a captioning model and bypasses its weaknesses. It is demonstrated in the experiments on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

WingsBrokenAngel/delving-deeper-into-the-decoder-for-video-captioning
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization

MethodsVariational Dropout · Dropout · Layer Normalization