How to Make a BLT Sandwich? Learning to Reason towards Understanding Web   Instructional Videos

Shaojie Wang; Wentian Zhao; Ziyi Kou; Chenliang Xu

arXiv:1812.00344·cs.CV·December 7, 2018·5 cites

How to Make a BLT Sandwich? Learning to Reason towards Understanding Web Instructional Videos

Shaojie Wang, Wentian Zhao, Ziyi Kou, Chenliang Xu

PDF

Open Access

TL;DR

This paper introduces a new dataset and models for understanding long, structured instructional videos through question-answering, emphasizing temporal reasoning and multimodal information to improve comprehension accuracy.

Contribution

It presents YouQuek, a novel QA dataset for instructional videos, and proposes a Recurrent Graph Convolutional Network to enhance reasoning over long videos.

Findings

01

RGCN outperforms other models in QA accuracy

02

Adding human-annotated descriptions improves understanding

03

Multimodal approaches boost video comprehension

Abstract

Understanding web instructional videos is an essential branch of video understanding in two aspects. First, most existing video methods focus on short-term actions for a-few-second-long video clips; these methods are not directly applicable to long videos. Second, unlike unconstrained long videos, e.g., movies, instructional videos are more structured in that they have step-by-step procedure constraining the understanding task. In this paper, we study reasoning on instructional videos via question-answering (QA). Surprisingly, it has not been an emphasis in the video community despite its rich applications. We thereby introduce YouQuek, an annotated QA dataset for instructional videos based on the recent YouCook2. The questions in YouQuek are not limited to cues on one frame but related to logical reasoning in the temporal dimension. Observing the lack of effective representations for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning

MethodsRelational Graph Convolution Network