TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency
Medhini Narasimhan, Arsha Nagrani, Chen Sun, Michael Rubinstein,, Trevor Darrell, Anna Rohrbach, Cordelia Schmid

TL;DR
This paper introduces a novel method for summarizing instructional videos by leveraging task relevance and cross-modal saliency, using pseudo summaries for training and a new benchmark for evaluation.
Contribution
It proposes an automatic pseudo summary generation approach and a new summarization network tailored for instructional videos, improving over existing methods.
Findings
Outperforms baseline models on the WikiHow Summaries dataset.
Uses pseudo summaries for weak supervision effectively.
Demonstrates the importance of task relevance and cross-modal cues in summarization.
Abstract
YouTube users looking for instructions for a specific task may spend a long time browsing content trying to find the right video that matches their needs. Creating a visual summary (abridged version of a video) provides viewers with a quick overview and massively reduces search time. In this work, we focus on summarizing instructional videos, an under-explored area of video summarization. In comparison to generic videos, instructional videos can be parsed into semantically meaningful segments that correspond to important steps of the demonstrated task. Existing video summarization datasets rely on manual frame-level annotations, making them subjective and limited in size. To overcome this, we first automatically generate pseudo summaries for a corpus of instructional videos by exploiting two key assumptions: (i) relevant steps are likely to appear in multiple videos of the same task…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
MethodsTest
