Learning To Recognize Procedural Activities with Distant Supervision
Xudong Lin, Fabio Petroni, Gedas Bertasius, Marcus Rohrbach, Shih-Fu, Chang, Lorenzo Torresani

TL;DR
This paper introduces a method for classifying complex, multi-step activities in long videos by automatically identifying steps through distant supervision from a textual knowledge base, improving generalization across various tasks.
Contribution
The paper presents a novel approach that leverages distant supervision from wikiHow to automatically label steps in instructional videos, enabling training without manual annotations.
Findings
Models trained with automatically-labeled steps outperform baselines.
The approach generalizes well to multiple downstream tasks.
Automatic step identification improves activity recognition accuracy.
Abstract
In this paper we consider the problem of classifying fine-grained, multi-step activities (e.g., cooking different recipes, making disparate home improvements, creating various forms of arts and crafts) from long videos spanning up to several minutes. Accurately categorizing these activities requires not only recognizing the individual steps that compose the task but also capturing their temporal dependencies. This problem is dramatically different from traditional action classification, where models are typically optimized on videos that span only a few seconds and that are manually trimmed to contain simple atomic actions. While step annotations could enable the training of models to recognize the individual steps of procedural activities, existing large-scale datasets in this area do not include such segment labels due to the prohibitive cost of manually annotating temporal boundaries…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Music and Audio Processing
MethodsBalanced Selection
