CLIP Meets Video Captioning: Concept-Aware Representation Learning Does   Matter

Bang Yang; Tong Zhang; and Yuexian Zou

arXiv:2111.15162·cs.CV·August 23, 2022·1 cites

CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter

Bang Yang, Tong Zhang, and Yuexian Zou

PDF

Open Access 1 Repo

TL;DR

This paper investigates the impact of CLIP pre-training on video captioning, revealing its advantages over traditional ImageNet pre-training in capturing concepts and proposing a new auxiliary task to enhance concept-aware representations.

Contribution

The paper demonstrates the benefits of CLIP over INP for video captioning and introduces Dual Concept Detection (DCD), a novel auxiliary task for better concept-aware learning.

Findings

01

CLIP-based models outperform INP in caption quality.

02

INP models struggle with concept semantics and background noise.

03

DCD improves captioning performance and concept learning.

Abstract

For video captioning, "pre-training and fine-tuning" has become a de facto paradigm, where ImageNet Pre-training (INP) is usually used to encode the video content, then a task-oriented network is fine-tuned from scratch to cope with caption generation. This paper first investigates the impact of the recently proposed CLIP (Contrastive Language-Image Pre-training) on video captioning. Through the empirical study on INP vs. CLIP, we identify the potential deficiencies of INP and explore the key factors for accurate description generation. The results show that the INP-based model is tricky to capture concepts' semantics and sensitive to irrelevant background information. By contrast, the CLIP-based model significantly improves the caption quality and highlights the importance of concept-aware representation learning. With these findings, we propose Dual Concept Detection (DCD) further to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yangbang18/CLIP-Captioner
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Cancer-related molecular mechanisms research

MethodsContrastive Language-Image Pre-training