Learning to Ground Instructional Articles in Videos through Narrations
Effrosyni Mavroudi, Triantafyllos Afouras, Lorenzo Torresani

TL;DR
This paper introduces a method for automatically localizing steps in narrated instructional videos by leveraging large-scale instructional text, multimodal alignment, and iterative pseudo-label refinement, validated on a new benchmark.
Contribution
The paper presents a novel approach for unsupervised temporal grounding of procedural steps in videos using multimodal data and a new evaluation benchmark.
Findings
Significant improvements over baselines in step localization accuracy.
Effective multimodal alignment of frames, narrations, and step descriptions.
State-of-the-art performance on narration-video alignment benchmark.
Abstract
In this paper we present an approach for localizing steps of procedural activities in narrated how-to videos. To deal with the scarcity of labeled data at scale, we source the step descriptions from a language knowledge base (wikiHow) containing instructional articles for a large variety of procedural tasks. Without any form of manual supervision, our model learns to temporally ground the steps of procedural articles in how-to videos by matching three modalities: frames, narrations, and step descriptions. Specifically, our method aligns steps to video by fusing information from two distinct pathways: i) {\em direct} alignment of step descriptions to frames, ii) {\em indirect} alignment obtained by composing steps-to-narrations with narrations-to-video correspondences. Notably, our approach performs global temporal grounding of all steps in an article at once by exploiting order…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Learning to Ground Instructional Articles in Videos through Narrations· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Video Analysis and Summarization
MethodsTest · Balanced Selection
