CLIP4Caption: CLIP for Video Caption
Mingkang Tang, Zhanyu Wang, Zhenhua Liu, Fengyun Rao, Dian Li, Xiu Li

TL;DR
This paper introduces CLIP4Caption, a framework that enhances video captioning by leveraging CLIP for better visual-text alignment and employs a Transformer decoder with an ensemble strategy, achieving state-of-the-art results.
Contribution
The paper proposes a CLIP-enhanced video-text matching network and a Transformer-based decoder, along with a novel ensemble strategy, to improve video captioning performance.
Findings
Achieved up to 10% improvement in CIDEr score on MSR-VTT dataset.
Secured 2nd place in ACM MM 2021 Video Understanding Challenge.
Demonstrated effectiveness with state-of-the-art results on two datasets.
Abstract
Video captioning is a challenging task since it requires generating sentences describing various diverse and complex videos. Existing video captioning models lack adequate visual representation due to the neglect of the existence of gaps between videos and texts. To bridge this gap, in this paper, we propose a CLIP4Caption framework that improves video captioning based on a CLIP-enhanced video-text matching network (VTM). This framework is taking full advantage of the information from both vision and language and enforcing the model to learn strongly text-correlated video features for text generation. Besides, unlike most existing models using LSTM or GRU as the sentence decoder, we adopt a Transformer structured decoder network to effectively learn the long-range visual and language dependency. Additionally, we introduce a novel ensemble strategy for captioning tasks. Experimental…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
MethodsMulti-Head Attention · Attention Is All You Need · Test · Linear Layer · Adam · Label Smoothing · Byte Pair Encoding · Dense Connections · Absolute Position Encodings · Softmax
