Uncertainty-Aware and Decoder-Aligned Learning for Video Summarization

Omer Tariq; Syed Muhammad Raza; Jeongbae Son

arXiv:2605.09507·cs.CV·May 12, 2026

Uncertainty-Aware and Decoder-Aligned Learning for Video Summarization

Omer Tariq, Syed Muhammad Raza, Jeongbae Son

PDF

TL;DR

This paper introduces VASTSum, a novel video summarization framework that models uncertainty and aligns with decoding procedures, improving robustness and efficiency over existing methods.

Contribution

VASTSum is the first single-pass, uncertainty-aware, decoder-aligned learning framework for video summarization that explicitly models annotation subjectivity and stabilizes summary selection.

Findings

01

Achieves consistent Kendall and Spearman correlations on SumMe and TVSum.

02

Demonstrates improved robustness to annotation disagreement.

03

Maintains efficient single-forward inference.

Abstract

Video summarization aims to produce a compact representation of a long video by selecting a subset of temporally important segments that best reflect human preferences. This task is inherently difficult due to strong annotation subjectivity and the reliance on discrete decoding procedures, such as temporal segmentation and knapsack-based selection, during evaluation. Most existing approaches either learn deterministic importance scores that overlook these characteristics or adopt complex generative models that increase training and inference cost. In this paper, we propose VASTSum, an uncertainty-aware and decoder-aligned learning framework for video summarization that addresses both challenges within a single-pass model. The proposed method predicts probabilistic frame-level importance scores using a variational formulation, enabling explicit modeling of uncertainty arising from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.