CLIP4Caption: CLIP for Video Caption

Mingkang Tang; Zhanyu Wang; Zhenhua Liu; Fengyun Rao; Dian Li; Xiu Li

arXiv:2110.06615·cs.CV·October 14, 2021

CLIP4Caption: CLIP for Video Caption

Mingkang Tang, Zhanyu Wang, Zhenhua Liu, Fengyun Rao, Dian Li, Xiu Li

PDF

Open Access

TL;DR

This paper introduces CLIP4Caption, a framework that enhances video captioning by leveraging CLIP for better visual-text alignment and employs a Transformer decoder with an ensemble strategy, achieving state-of-the-art results.

Contribution

The paper proposes a CLIP-enhanced video-text matching network and a Transformer-based decoder, along with a novel ensemble strategy, to improve video captioning performance.

Findings

01

Achieved up to 10% improvement in CIDEr score on MSR-VTT dataset.

02

Secured 2nd place in ACM MM 2021 Video Understanding Challenge.

03

Demonstrated effectiveness with state-of-the-art results on two datasets.

Abstract

Video captioning is a challenging task since it requires generating sentences describing various diverse and complex videos. Existing video captioning models lack adequate visual representation due to the neglect of the existence of gaps between videos and texts. To bridge this gap, in this paper, we propose a CLIP4Caption framework that improves video captioning based on a CLIP-enhanced video-text matching network (VTM). This framework is taking full advantage of the information from both vision and language and enforcing the model to learn strongly text-correlated video features for text generation. Besides, unlike most existing models using LSTM or GRU as the sentence decoder, we adopt a Transformer structured decoder network to effectively learn the long-range visual and language dependency. Additionally, we introduce a novel ensemble strategy for captioning tasks. Experimental…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization

MethodsMulti-Head Attention · Attention Is All You Need · Test · Linear Layer · Adam · Label Smoothing · Byte Pair Encoding · Dense Connections · Absolute Position Encodings · Softmax