Less is More: Label-Guided Summarization of Procedural and Instructional Videos
Shreya Rajpal, Michal Golovanevsky, Carsten Eickhoff

TL;DR
This paper introduces PRISM, a three-stage framework that creates concise, semantically grounded video summaries by combining adaptive sampling, label-driven keyframe selection, and large language model validation, significantly reducing frames while retaining content.
Contribution
The paper presents a novel three-stage summarization framework that integrates semantic understanding and multimodal analysis, improving the quality and coherence of video summaries across domains.
Findings
Retains 84% of semantic content with less than 5% of frames sampled.
Outperforms baseline methods by up to 33% in summary quality.
Generalizes effectively across procedural and instructional videos.
Abstract
Video summarization helps turn long videos into clear, concise representations that are easier to review, document, and analyze, especially in high-stakes domains like surgical training. Prior work has progressed from using basic visual features like color, motion, and structural changes to using pre-trained vision-language models that can better understand what's happening in the video (semantics) and capture temporal flow, resulting in more context-aware video summarization. We propose a three-stage framework, PRISM: Procedural Representation via Integrated Semantic and Multimodal analysis, that produces semantically grounded video summaries. PRISM combines adaptive visual sampling, label-driven keyframe anchoring, and contextual validation using a large language model (LLM). Our method ensures that selected frames reflect meaningful and procedural transitions while filtering out…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
