TL;DR
This paper introduces ClipSum, a multimodal video summarization framework using frozen CLIP features that are semantically aligned with language, outperforming traditional CNN-based methods on instructional videos.
Contribution
The work demonstrates that leveraging frozen CLIP vision-language features with explicit temporal modeling improves instructional video summarization.
Findings
ClipSum achieves 33.0% ROUGE-1 on YouCook2, outperforming ResNet-152-based methods.
Frozen CLIP features outperform fine-tuned CLIP, emphasizing the importance of semantic alignment.
Semantic alignment of visual features with language enhances summarization quality.
Abstract
Multimodal video summarization requires visual features that align semantically with language generation. Traditional approaches rely on CNN features trained for object classification, which represent visual concepts as discrete categories not aligned with natural language. We propose ClipSum, a framework that leverages frozen CLIP vision-language features with explicit temporal modeling and dimension-adaptive fusion for instructional video summarization. CLIP's contrastive pre-training on 400M image-text pairs yields visual features semantically aligned with the linguistic concepts that text decoders generate, bridging the vision-language gap at the representation level. On YouCook2, ClipSum achieves 33.0% ROUGE-1 versus 30.5% for ResNet-152 with 4x lower dimensionality (512 vs. 2048), demonstrating that semantic alignment matters more than feature capacity. Frozen CLIP (33.0%)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
