Uncertainty-Aware and Decoder-Aligned Learning for Video Summarization
Omer Tariq, Syed Muhammad Raza, Jeongbae Son

TL;DR
This paper introduces VASTSum, a novel video summarization framework that models uncertainty and aligns with decoding procedures, improving robustness and efficiency over existing methods.
Contribution
VASTSum is the first single-pass, uncertainty-aware, decoder-aligned learning framework for video summarization that explicitly models annotation subjectivity and stabilizes summary selection.
Findings
Achieves consistent Kendall and Spearman correlations on SumMe and TVSum.
Demonstrates improved robustness to annotation disagreement.
Maintains efficient single-forward inference.
Abstract
Video summarization aims to produce a compact representation of a long video by selecting a subset of temporally important segments that best reflect human preferences. This task is inherently difficult due to strong annotation subjectivity and the reliance on discrete decoding procedures, such as temporal segmentation and knapsack-based selection, during evaluation. Most existing approaches either learn deterministic importance scores that overlook these characteristics or adopt complex generative models that increase training and inference cost. In this paper, we propose VASTSum, an uncertainty-aware and decoder-aligned learning framework for video summarization that addresses both challenges within a single-pass model. The proposed method predicts probabilistic frame-level importance scores using a variational formulation, enabling explicit modeling of uncertainty arising from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
