STEPs: Self-Supervised Key Step Extraction and Localization from Unlabeled Procedural Videos
Anshul Shah, Benjamin Lundell, Harpreet Sawhney, Rama Chellappa

TL;DR
This paper introduces STEPs, a self-supervised method for extracting and localizing key steps in unlabeled procedural videos, leveraging multi-cue features and a novel contrastive loss to improve AR-based training applications.
Contribution
It presents a new self-supervised learning framework with BMC2 loss and techniques for training lightweight temporal modules using multiple cues, enhancing key step extraction without labels.
Findings
Significant improvements in key step localization accuracy
Effective use of multi-cue information like optical flow, depth, and gaze
Qualitative results show meaningful and succinct key step representations
Abstract
We address the problem of extracting key steps from unlabeled procedural videos, motivated by the potential of Augmented Reality (AR) headsets to revolutionize job training and performance. We decompose the problem into two steps: representation learning and key steps extraction. We propose a training objective, Bootstrapped Multi-Cue Contrastive (BMC2) loss to learn discriminative representations for various steps without any labels. Different from prior works, we develop techniques to train a light-weight temporal module which uses off-the-shelf features for self supervision. Our approach can seamlessly leverage information from multiple cues like optical flow, depth or gaze to learn discriminative features for key-steps, making it amenable for AR applications. We finally extract key steps via a tunable algorithm that clusters the representations and samples. We show significant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
STEPs: Self-Supervised Key Step Extraction and Localization from Unlabeled Procedural Videos· youtube
Taxonomy
TopicsAdvanced Vision and Imaging
