TRIMMER: A New Paradigm for Video Summarization through Self-Supervised Reinforcement Learning
Pritam Mishra, Coloma Ballester, Dimosthenis Karatzas

TL;DR
TRIMMER is a self-supervised reinforcement learning framework that improves video summarization by capturing temporal dynamics and semantic diversity efficiently, outperforming existing unsupervised methods.
Contribution
It introduces a novel entropy-based reward mechanism and a two-stage learning process for scalable, domain-agnostic video summarization.
Findings
Achieves state-of-the-art results among unsupervised methods
Maintains competitive performance with supervised approaches
Enhances efficiency through direct reward computation over frames
Abstract
The rapid growth of video content across domains such as surveillance, education, and social media has made efficient content understanding increasingly critical. Video summarization addresses this challenge by generating concise yet semantically meaningful representations, but existing approaches often rely on expensive manual annotations, struggle to generalize across domains, and incur significant computational costs due to complex architectures. Moreover, unsupervised and weakly supervised methods typically underperform compared to supervised counterparts in capturing long-range temporal dependencies and semantic structure. In this work, we propose TRIMMER (Temporal Relative Information Maximization for Multi-objective Efficient Reinforcement), a novel self-supervised reinforcement learning framework for video summarization. TRIMMER operates in two stages: it first learns robust…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
