Movie Question Answering: Remembering the Textual Cues for Layered Visual Contents
Bo Wang, Youjiang Xu, Yahong Han, Richang Hong

TL;DR
This paper introduces a Layered Memory Network (LMN) that effectively combines visual content and subtitles to improve movie question answering, achieving state-of-the-art results on the MovieQA dataset.
Contribution
The paper proposes a novel hierarchical memory network that encodes frame-level and clip-level movie content using visual data and subtitles, enhancing movie understanding for question answering.
Findings
LMN with frame-level representation improves performance using only visual content.
Incorporating subtitles into LMN achieves state-of-the-art results.
Hierarchical movie representations show strong potential for movie QA applications.
Abstract
Movies provide us with a mass of visual content as well as attracting stories. Existing methods have illustrated that understanding movie stories through only visual content is still a hard problem. In this paper, for answering questions about movies, we put forward a Layered Memory Network (LMN) that represents frame-level and clip-level movie content by the Static Word Memory module and the Dynamic Subtitle Memory module, respectively. Particularly, we firstly extract words and sentences from the training movie subtitles. Then the hierarchically formed movie representations, which are learned from LMN, not only encode the correspondence between words and visual content inside frames, but also encode the temporal alignment between sentences and frames inside movie clips. We also extend our LMN model into three variant frameworks to illustrate the good extendable capabilities. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
MethodsMemory Network
