Context-Aware Pseudo-Label Scoring for Zero-Shot Video Summarization

Yuanli Wu; Long Zhang; Yue Du; Bin Li

arXiv:2510.17501·cs.CV·October 23, 2025

Context-Aware Pseudo-Label Scoring for Zero-Shot Video Summarization

Yuanli Wu, Long Zhang, Yue Du, Bin Li

PDF

Open Access

TL;DR

This paper introduces a zero-shot video summarization method that uses rubric-guided pseudo-labels and prompt-driven reasoning with large language models, achieving competitive results without training.

Contribution

It presents a novel framework combining pseudo-labeling and structured rubrics to enable stable, interpretable, and training-free zero-shot video summarization.

Findings

01

Achieves F1 scores of 57.58 on SumMe

02

Surpasses zero-shot baselines by +0.85 on SumMe

03

Demonstrates effectiveness across three benchmarks

Abstract

We propose a rubric-guided, pseudo-labeled, and prompt-driven zero-shot video summarization framework that bridges large language models with structured semantic reasoning. A small subset of human annotations is converted into high-confidence pseudo labels and organized into dataset-adaptive rubrics defining clear evaluation dimensions such as thematic relevance, action detail, and narrative progression. During inference, boundary scenes, including the opening and closing segments, are scored independently based on their own descriptions, while intermediate scenes incorporate concise summaries of adjacent segments to assess narrative continuity and redundancy. This design enables the language model to balance local salience with global coherence without any parameter tuning. Across three benchmarks, the proposed method achieves stable and competitive results, with F1 scores of 57.58 on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis