How to Make a BLT Sandwich? Learning to Reason towards Understanding Web Instructional Videos
Shaojie Wang, Wentian Zhao, Ziyi Kou, Chenliang Xu

TL;DR
This paper introduces a new dataset and models for understanding long, structured instructional videos through question-answering, emphasizing temporal reasoning and multimodal information to improve comprehension accuracy.
Contribution
It presents YouQuek, a novel QA dataset for instructional videos, and proposes a Recurrent Graph Convolutional Network to enhance reasoning over long videos.
Findings
RGCN outperforms other models in QA accuracy
Adding human-annotated descriptions improves understanding
Multimodal approaches boost video comprehension
Abstract
Understanding web instructional videos is an essential branch of video understanding in two aspects. First, most existing video methods focus on short-term actions for a-few-second-long video clips; these methods are not directly applicable to long videos. Second, unlike unconstrained long videos, e.g., movies, instructional videos are more structured in that they have step-by-step procedure constraining the understanding task. In this paper, we study reasoning on instructional videos via question-answering (QA). Surprisingly, it has not been an emphasis in the video community despite its rich applications. We thereby introduce YouQuek, an annotated QA dataset for instructional videos based on the recent YouCook2. The questions in YouQuek are not limited to cues on one frame but related to logical reasoning in the temporal dimension. Observing the lack of effective representations for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
MethodsRelational Graph Convolution Network
