Multimodal Dual Attention Memory for Video Story Question Answering
Kyung-Min Kim, Seong-Ho Choi, Jin-Hwa Kim, Byoung-Tak Zhang

TL;DR
This paper introduces MDAM, a novel video story question-answering model that employs dual attention and late fusion to effectively learn high-level vision-language representations, achieving state-of-the-art results on large-scale datasets.
Contribution
The paper presents a new dual attention memory architecture with late fusion for video QA, demonstrating superior performance over existing models.
Findings
MDAM achieves state-of-the-art results on PororoQA and MovieQA datasets.
Dual attention combined with late fusion improves inference accuracy.
Qualitative analysis visualizes the model's inference process.
Abstract
We propose a video story question-answering (QA) architecture, Multimodal Dual Attention Memory (MDAM). The key idea is to use a dual attention mechanism with late fusion. MDAM uses self-attention to learn the latent concepts in scene frames and captions. Given a question, MDAM uses the second attention over these latent concepts. Multimodal fusion is performed after the dual attention processes (late fusion). Using this processing pipeline, MDAM learns to infer a high-level vision-language joint representation from an abstraction of the full video content. We evaluate MDAM on PororoQA and MovieQA datasets which have large-scale QA annotations on cartoon videos and movies, respectively. For both datasets, MDAM achieves new state-of-the-art results with significant margins compared to the runner-up models. We confirm the best performance of the dual attention mechanism combined with late…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques
