DeepStory: Video Story QA by Deep Embedded Memory Networks

Kyung-Min Kim; Min-Oh Heo; Seong-Ho Choi; and Byoung-Tak Zhang

arXiv:1707.00836·cs.CV·July 5, 2017·34 cites

DeepStory: Video Story QA by Deep Embedded Memory Networks

Kyung-Min Kim, Min-Oh Heo, Seong-Ho Choi, and Byoung-Tak Zhang

PDF

Open Access

TL;DR

This paper introduces Deep Embedded Memory Networks (DEMN), a novel AI model that learns from cartoon videos to perform video story question-answering by reconstructing stories in a latent space and using attention mechanisms.

Contribution

The paper presents DEMN, a new model that effectively combines scene-dialogue reconstruction and attention for video story QA, outperforming existing models and achieving state-of-the-art results.

Findings

01

DEMN outperforms other QA models on the Pororo dataset.

02

DEMN achieves state-of-the-art results on the MovieQA benchmark.

03

Story reconstruction in a latent embedding space improves QA performance.

Abstract

Question-answering (QA) on video contents is a significant challenge for achieving human-level intelligence as it involves both vision and language in real-world settings. Here we demonstrate the possibility of an AI agent performing video story QA by learning from a large amount of cartoon videos. We develop a video-story learning model, i.e. Deep Embedded Memory Networks (DEMN), to reconstruct stories from a joint scene-dialogue video stream using a latent embedding space of observed data. The video stories are stored in a long-term memory component. For a given question, an LSTM-based attention model uses the long-term memory to recall the best question-story-answer triplet by focusing on specific words containing key information. We trained the DEMN on a novel QA dataset of children's cartoon video series, Pororo. The dataset contains 16,066 scene-dialogue pairs of 20.5-hour videos,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning