Multimodal Dual Attention Memory for Video Story Question Answering

Kyung-Min Kim; Seong-Ho Choi; Jin-Hwa Kim; Byoung-Tak Zhang

arXiv:1809.07999·cs.CV·September 24, 2018·6 cites

Multimodal Dual Attention Memory for Video Story Question Answering

Kyung-Min Kim, Seong-Ho Choi, Jin-Hwa Kim, Byoung-Tak Zhang

PDF

Open Access

TL;DR

This paper introduces MDAM, a novel video story question-answering model that employs dual attention and late fusion to effectively learn high-level vision-language representations, achieving state-of-the-art results on large-scale datasets.

Contribution

The paper presents a new dual attention memory architecture with late fusion for video QA, demonstrating superior performance over existing models.

Findings

01

MDAM achieves state-of-the-art results on PororoQA and MovieQA datasets.

02

Dual attention combined with late fusion improves inference accuracy.

03

Qualitative analysis visualizes the model's inference process.

Abstract

We propose a video story question-answering (QA) architecture, Multimodal Dual Attention Memory (MDAM). The key idea is to use a dual attention mechanism with late fusion. MDAM uses self-attention to learn the latent concepts in scene frames and captions. Given a question, MDAM uses the second attention over these latent concepts. Multimodal fusion is performed after the dual attention processes (late fusion). Using this processing pipeline, MDAM learns to infer a high-level vision-language joint representation from an abstraction of the full video content. We evaluate MDAM on PororoQA and MovieQA datasets which have large-scale QA annotations on cartoon videos and movies, respectively. For both datasets, MDAM achieves new state-of-the-art results with significant margins compared to the runner-up models. We confirm the best performance of the dual attention mechanism combined with late…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques