Context-Aware Pseudo-Label Scoring for Zero-Shot Video Summarization
Yuanli Wu, Long Zhang, Yue Du, Bin Li

TL;DR
This paper introduces a zero-shot video summarization method that uses rubric-guided pseudo-labels and prompt-driven reasoning with large language models, achieving competitive results without training.
Contribution
It presents a novel framework combining pseudo-labeling and structured rubrics to enable stable, interpretable, and training-free zero-shot video summarization.
Findings
Achieves F1 scores of 57.58 on SumMe
Surpasses zero-shot baselines by +0.85 on SumMe
Demonstrates effectiveness across three benchmarks
Abstract
We propose a rubric-guided, pseudo-labeled, and prompt-driven zero-shot video summarization framework that bridges large language models with structured semantic reasoning. A small subset of human annotations is converted into high-confidence pseudo labels and organized into dataset-adaptive rubrics defining clear evaluation dimensions such as thematic relevance, action detail, and narrative progression. During inference, boundary scenes, including the opening and closing segments, are scored independently based on their own descriptions, while intermediate scenes incorporate concise summaries of adjacent segments to assess narrative continuity and redundancy. This design enables the language model to balance local salience with global coherence without any parameter tuning. Across three benchmarks, the proposed method achieves stable and competitive results, with F1 scores of 57.58 on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
