Video Question Answering on Screencast Tutorials

Wentian Zhao; Seokhwan Kim; Ning Xu; Hailin Jin

arXiv:2008.00544·cs.CL·August 4, 2020

Video Question Answering on Screencast Tutorials

Wentian Zhao, Seokhwan Kim, Ning Xu, Hailin Jin

PDF

TL;DR

This paper introduces a new video question answering task for screencast tutorials, utilizing a domain-grounded dataset and novel neural models that incorporate multi-modal context and domain knowledge to improve performance.

Contribution

It presents a new dataset and baseline models for video question answering on screencast tutorials, emphasizing domain knowledge grounding and multi-modal context integration.

Findings

01

Models with multi-modal context outperform baselines.

02

Incorporating domain knowledge improves accuracy.

03

Proposed algorithms effectively extract visual cues.

Abstract

This paper presents a new video question answering task on screencast tutorials. We introduce a dataset including question, answer and context triples from the tutorial videos for a software. Unlike other video question answering works, all the answers in our dataset are grounded to the domain knowledge base. An one-shot recognition algorithm is designed to extract the visual cues, which helps enhance the performance of video question answering. We also propose several baseline neural network architectures based on various aspects of video contexts from the dataset. The experimental results demonstrate that our proposed models significantly improve the question answering performances by incorporating multi-modal contexts and domain knowledge.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.