Prompts to Summaries: Zero-Shot Language-Guided Video Summarization with Large Language and Video Models
Mario Barbara, Alaa Maalouf

TL;DR
This paper presents Prompts-to-Summaries, a zero-shot, user-guided video summarization method using large language and video models, outperforming unsupervised methods without training data.
Contribution
It introduces a novel zero-shot framework that converts video captions into user-guided summaries using off-the-shelf models and large language models, without requiring training.
Findings
Surpasses all prior unsupervised methods on SumMe and TVSum datasets.
Performs competitively on the Query-Focused Video Summarization benchmark.
Provides a new dataset, VidSum-Reason, for long-tailed, multi-step reasoning video queries.
Abstract
The explosive growth of video data intensified the need for flexible user-controllable summarization tools that operate without training data. Existing methods either rely on domain-specific datasets, limiting generalization, or cannot incorporate user intent expressed in natural language. We introduce Prompts-to-Summaries: the first zero-shot, text-queryable video-summarizer that converts off-the-shelf video-language models (VidLMs) captions into user-guided skims via large-language-models (LLMs) judging, without the use of training data, beating unsupervised and matching supervised methods. Our pipeline (i) segments video into scenes, (ii) produces scene descriptions with a memory-efficient batch prompting scheme that scales to hours on a single GPU, (iii) scores scene importance with an LLM via tailored prompts, and (iv) propagates scores to frames using new consistency (temporal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Generative Adversarial Networks and Image Synthesis
