VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models
Jiapeng Wang, Chengyu Wang, Kunzhe Huang, Jun Huang, Lianwen Jin

TL;DR
VideoCLIP-XL enhances video-language models' ability to understand and rank long, detailed descriptions by introducing new datasets, training techniques, and evaluation benchmarks, addressing limitations of existing CLIP models.
Contribution
The paper presents VideoCLIP-XL, a novel model with new training data, tasks, and benchmarks to improve long description understanding in video CLIP models.
Findings
Improved performance on long description video retrieval tasks.
Effective long description understanding demonstrated on new LVDR benchmark.
Enhanced feature distribution learning via TPCM.
Abstract
Contrastive Language-Image Pre-training (CLIP) has been widely studied and applied in numerous applications. However, the emphasis on brief summary texts during pre-training prevents CLIP from understanding long descriptions. This issue is particularly acute regarding videos given that videos often contain abundant detailed contents. In this paper, we propose the VideoCLIP-XL (eXtra Length) model, which aims to unleash the long-description understanding capability of video CLIP models. Firstly, we establish an automatic data collection system and gather a large-scale VILD pre-training dataset with VIdeo and Long-Description pairs. Then, we propose Text-similarity-guided Primary Component Matching (TPCM) to better learn the distribution of feature space while expanding the long description capability. We also introduce two new tasks namely Detail-aware Description Ranking (DDR) and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsVideo Analysis and Summarization · Image Retrieval and Classification Techniques · Power Systems and Technologies
MethodsContrastive Language-Image Pre-training
