Multimodal Abstractive Summarization of Instructional Videos with Vision-Language Models

Maham Nazir; Muhammad Aqeel; Richong Zhang; Francesco Setti

arXiv:2605.11959·cs.CV·May 13, 2026

Multimodal Abstractive Summarization of Instructional Videos with Vision-Language Models

Maham Nazir, Muhammad Aqeel, Richong Zhang, Francesco Setti

PDF

1 Repo

TL;DR

This paper introduces ClipSum, a multimodal video summarization framework using frozen CLIP features that are semantically aligned with language, outperforming traditional CNN-based methods on instructional videos.

Contribution

The work demonstrates that leveraging frozen CLIP vision-language features with explicit temporal modeling improves instructional video summarization.

Findings

01

ClipSum achieves 33.0% ROUGE-1 on YouCook2, outperforming ResNet-152-based methods.

02

Frozen CLIP features outperform fine-tuned CLIP, emphasizing the importance of semantic alignment.

03

Semantic alignment of visual features with language enhances summarization quality.

Abstract

Multimodal video summarization requires visual features that align semantically with language generation. Traditional approaches rely on CNN features trained for object classification, which represent visual concepts as discrete categories not aligned with natural language. We propose ClipSum, a framework that leverages frozen CLIP vision-language features with explicit temporal modeling and dimension-adaptive fusion for instructional video summarization. CLIP's contrastive pre-training on 400M image-text pairs yields visual features semantically aligned with the linguistic concepts that text decoders generate, bridging the vision-language gap at the representation level. On YouCook2, ClipSum achieves 33.0% ROUGE-1 versus 30.5% for ResNet-152 with 4x lower dimensionality (512 vs. 2048), demonstrating that semantic alignment matters more than feature capacity. Frozen CLIP (33.0%)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aqeeelmirza/clipsum
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.