Masked Autoencoder for Unsupervised Video Summarization
Minho Shim, Taeoh Kim, Jinhyung Kim, Dongyoon Wee

TL;DR
This paper introduces an unsupervised autoencoder approach that leverages self-supervised learning and reconstruction scores to effectively perform dense video summarization without additional architecture modifications or fine-tuning.
Contribution
It demonstrates that a self-supervised autoencoder can be directly used for video summarization by utilizing reconstruction scores, eliminating the need for extra downstream design.
Findings
Effective in major unsupervised video summarization benchmarks
No additional architecture or fine-tuning required
Utilizes reconstruction scores for importance estimation
Abstract
Summarizing a video requires a diverse understanding of the video, ranging from recognizing scenes to evaluating how much each frame is essential enough to be selected as a summary. Self-supervised learning (SSL) is acknowledged for its robustness and flexibility to multiple downstream tasks, but the video SSL has not shown its value for dense understanding tasks like video summarization. We claim an unsupervised autoencoder with sufficient self-supervised learning does not need any extra downstream architecture design or fine-tuning weights to be utilized as a video summarization model. The proposed method to evaluate the importance score of each frame takes advantage of the reconstruction score of the autoencoder's decoder. We evaluate the method in major unsupervised video summarization benchmarks to show its effectiveness under various experimental settings.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Music and Audio Processing · Natural Language Processing Techniques
