CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter
Bang Yang, Tong Zhang, and Yuexian Zou

TL;DR
This paper investigates the impact of CLIP pre-training on video captioning, revealing its advantages over traditional ImageNet pre-training in capturing concepts and proposing a new auxiliary task to enhance concept-aware representations.
Contribution
The paper demonstrates the benefits of CLIP over INP for video captioning and introduces Dual Concept Detection (DCD), a novel auxiliary task for better concept-aware learning.
Findings
CLIP-based models outperform INP in caption quality.
INP models struggle with concept semantics and background noise.
DCD improves captioning performance and concept learning.
Abstract
For video captioning, "pre-training and fine-tuning" has become a de facto paradigm, where ImageNet Pre-training (INP) is usually used to encode the video content, then a task-oriented network is fine-tuned from scratch to cope with caption generation. This paper first investigates the impact of the recently proposed CLIP (Contrastive Language-Image Pre-training) on video captioning. Through the empirical study on INP vs. CLIP, we identify the potential deficiencies of INP and explore the key factors for accurate description generation. The results show that the INP-based model is tricky to capture concepts' semantics and sensitive to irrelevant background information. By contrast, the CLIP-based model significantly improves the caption quality and highlights the importance of concept-aware representation learning. With these findings, we propose Dual Concept Detection (DCD) further to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Cancer-related molecular mechanisms research
MethodsContrastive Language-Image Pre-training
