DeepStory: Video Story QA by Deep Embedded Memory Networks
Kyung-Min Kim, Min-Oh Heo, Seong-Ho Choi, and Byoung-Tak Zhang

TL;DR
This paper introduces Deep Embedded Memory Networks (DEMN), a novel AI model that learns from cartoon videos to perform video story question-answering by reconstructing stories in a latent space and using attention mechanisms.
Contribution
The paper presents DEMN, a new model that effectively combines scene-dialogue reconstruction and attention for video story QA, outperforming existing models and achieving state-of-the-art results.
Findings
DEMN outperforms other QA models on the Pororo dataset.
DEMN achieves state-of-the-art results on the MovieQA benchmark.
Story reconstruction in a latent embedding space improves QA performance.
Abstract
Question-answering (QA) on video contents is a significant challenge for achieving human-level intelligence as it involves both vision and language in real-world settings. Here we demonstrate the possibility of an AI agent performing video story QA by learning from a large amount of cartoon videos. We develop a video-story learning model, i.e. Deep Embedded Memory Networks (DEMN), to reconstruct stories from a joint scene-dialogue video stream using a latent embedding space of observed data. The video stories are stored in a long-term memory component. For a given question, an LSTM-based attention model uses the long-term memory to recall the best question-story-answer triplet by focusing on specific words containing key information. We trained the DEMN on a novel QA dataset of children's cartoon video series, Pororo. The dataset contains 16,066 scene-dialogue pairs of 20.5-hour videos,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
