Less is More: Label-Guided Summarization of Procedural and Instructional Videos

Shreya Rajpal; Michal Golovanevsky; Carsten Eickhoff

arXiv:2601.12243·cs.CV·February 2, 2026

Less is More: Label-Guided Summarization of Procedural and Instructional Videos

Shreya Rajpal, Michal Golovanevsky, Carsten Eickhoff

PDF

Open Access

TL;DR

This paper introduces PRISM, a three-stage framework that creates concise, semantically grounded video summaries by combining adaptive sampling, label-driven keyframe selection, and large language model validation, significantly reducing frames while retaining content.

Contribution

The paper presents a novel three-stage summarization framework that integrates semantic understanding and multimodal analysis, improving the quality and coherence of video summaries across domains.

Findings

01

Retains 84% of semantic content with less than 5% of frames sampled.

02

Outperforms baseline methods by up to 33% in summary quality.

03

Generalizes effectively across procedural and instructional videos.

Abstract

Video summarization helps turn long videos into clear, concise representations that are easier to review, document, and analyze, especially in high-stakes domains like surgical training. Prior work has progressed from using basic visual features like color, motion, and structural changes to using pre-trained vision-language models that can better understand what's happening in the video (semantics) and capture temporal flow, resulting in more context-aware video summarization. We propose a three-stage framework, PRISM: Procedural Representation via Integrated Semantic and Multimodal analysis, that produces semantically grounded video summaries. PRISM combines adaptive visual sampling, label-driven keyframe anchoring, and contextual validation using a large language model (LLM). Our method ensures that selected frames reflect meaningful and procedural transitions while filtering out…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization