TRIM: A Self-Supervised Video Summarization Framework Maximizing Temporal Relative Information and Representativeness
Pritam Mishra, Coloma Ballester, Dimosthenis Karatzas

TL;DR
This paper presents a novel self-supervised video summarization framework that captures spatial and temporal dependencies efficiently without attention mechanisms, achieving state-of-the-art results on benchmark datasets.
Contribution
It introduces a self-supervised model using Markov process-driven loss metrics and a two-stage learning paradigm, eliminating the need for supervision or attention-based models.
Findings
Achieves state-of-the-art performance on SUMME and TVSUM datasets.
Outperforms existing unsupervised methods and rivals supervised models.
Demonstrates efficiency and generalizability in video summarization.
Abstract
The increasing ubiquity of video content and the corresponding demand for efficient access to meaningful information have elevated video summarization and video highlights as a vital research area. However, many state-of-the-art methods depend heavily either on supervised annotations or on attention-based models, which are computationally expensive and brittle in the face of distribution shifts that hinder cross-domain applicability across datasets. We introduce a pioneering self-supervised video summarization model that captures both spatial and temporal dependencies without the overhead of attention, RNNs, or transformers. Our framework integrates a novel set of Markov process-driven loss metrics and a two-stage self supervised learning paradigm that ensures both performance and efficiency. Our approach achieves state-of-the-art performance on the SUMME and TVSUM datasets,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
