Learning Procedure-aware Video Representation from Instructional Videos   and Their Narrations

Yiwu Zhong; Licheng Yu; Yang Bai; Shangwen Li; Xueting Yan; Yin Li

arXiv:2303.17839·cs.CV·April 3, 2023·1 cites

Learning Procedure-aware Video Representation from Instructional Videos and Their Narrations

Yiwu Zhong, Licheng Yu, Yang Bai, Shangwen Li, Xueting Yan, Yin Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces a procedure-aware video representation learning method from instructional videos and narrations, capturing action steps and their temporal order without human annotations, improving step recognition and reasoning.

Contribution

It proposes a joint learning framework combining step concept encoding and probabilistic modeling of temporal dependencies, advancing procedural understanding in videos.

Findings

01

Improves step classification accuracy on COIN and EPIC-Kitchens datasets.

02

Enhances step forecasting performance significantly.

03

Enables effective zero-shot step inference and diverse step prediction.

Abstract

The abundance of instructional videos and their narrations over the Internet offers an exciting avenue for understanding procedural activities. In this work, we propose to learn video representation that encodes both action steps and their temporal ordering, based on a large-scale dataset of web instructional videos and their narrations, without using human annotations. Our method jointly learns a video representation to encode individual step concepts, and a deep probabilistic model to capture both temporal dependencies and immense individual variations in the step ordering. We empirically demonstrate that learning temporal ordering not only enables new capabilities for procedure reasoning, but also reinforces the recognition of individual steps. Our model significantly advances the state-of-the-art results on step classification (+2.8% / +3.3% on COIN / EPIC-Kitchens) and step…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/procedurevrl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Multimodal Machine Learning Applications