Event-Guided Procedure Planning from Instructional Videos with Text   Supervision

An-Lan Wang; Kun-Yu Lin; Jia-Run Du; Jingke Meng; Wei-Shi Zheng

arXiv:2308.08885·cs.CV·August 21, 2023

Event-Guided Procedure Planning from Instructional Videos with Text Supervision

An-Lan Wang, Kun-Yu Lin, Jia-Run Du, Jingke Meng, Wei-Shi Zheng

PDF

Open Access

TL;DR

This paper introduces an event-guided approach for procedure planning from instructional videos with text supervision, effectively bridging the semantic gap between visual states and actions by inferring events and planning actions accordingly.

Contribution

The paper proposes a novel event-guided paradigm and the E3P model that incorporates event inference and relation mining for improved procedure planning from videos.

Findings

01

E3P outperforms previous methods on three datasets.

02

Event inference improves planning accuracy.

03

Mask-and-predict enhances relation modeling.

Abstract

In this work, we focus on the task of procedure planning from instructional videos with text supervision, where a model aims to predict an action sequence to transform the initial visual state into the goal visual state. A critical challenge of this task is the large semantic gap between observed visual states and unobserved intermediate actions, which is ignored by previous works. Specifically, this semantic gap refers to that the contents in the observed visual states are semantically different from the elements of some action text labels in a procedure. To bridge this semantic gap, we propose a novel event-guided paradigm, which first infers events from the observed states and then plans out actions based on both the states and predicted events. Our inspiration comes from that planning a procedure from an instructional video is to complete a specific event and a specific event…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Human Pose and Action Recognition

MethodsFocus