Holistic Multi-modal Memory Network for Movie Question Answering
Anran Wang, Anh Tuan Luu, Chuan-Sheng Foo, Hongyuan Zhu, Yi Tay, Vijay, Chandrasekhar

TL;DR
This paper introduces the Holistic Multi-modal Memory Network (HMMN), a novel framework that fully integrates multi-modal context, questions, and answer choices for improved movie question answering accuracy.
Contribution
The paper proposes a comprehensive multi-hop attention framework that considers all data sources simultaneously, enhancing multi-modal reasoning in question answering tasks.
Findings
Achieves state-of-the-art accuracy on MovieQA dataset.
Demonstrates the effectiveness of holistic reasoning and attention strategies.
Shows significant improvements over partial interaction models.
Abstract
Answering questions according to multi-modal context is a challenging problem as it requires a deep integration of different data sources. Existing approaches only employ partial interactions among data sources in one attention hop. In this paper, we present the Holistic Multi-modal Memory Network (HMMN) framework which fully considers the interactions between different input sources (multi-modal context, question) in each hop. In addition, it takes answer choices into consideration during the context retrieval stage. Therefore, the proposed framework effectively integrates multi-modal context, question, and answer information, which leads to more informative context retrieved for question answering. Our HMMN framework achieves state-of-the-art accuracy on MovieQA dataset. Extensive ablation studies show the importance of holistic reasoning and contributions of different attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
MethodsMemory Network
