Adversarial Multimodal Network for Movie Question Answering
Zhaoquan Yuan, Siyuan Sun, Lixin Duan, Xiao Wu, Changsheng Xu

TL;DR
This paper introduces an Adversarial Multimodal Network (AMN) that enhances video story understanding for question answering by learning coherent multimodal features through adversarial training and self-attention mechanisms, outperforming existing methods.
Contribution
The paper proposes a novel AMN model that uses adversarial learning and self-attention to improve multimodal feature representation for video question answering.
Findings
AMN outperforms state-of-the-art methods on MovieQA dataset.
Self-attention enforces consistency in multimodal representations.
Adversarial training enhances the coherence of multimodal features.
Abstract
Visual question answering by using information from multiple modalities has attracted more and more attention in recent years. However, it is a very challenging task, as the visual content and natural language have quite different statistical properties. In this work, we present a method called Adversarial Multimodal Network (AMN) to better understand video stories for question answering. In AMN, as inspired by generative adversarial networks, we propose to learn multimodal feature representations by finding a more coherent subspace for video clips and the corresponding texts (e.g., subtitles and questions). Moreover, we introduce a self-attention mechanism to enforce the so-called consistency constraints in order to preserve the self-correlation of visual cues of the original video clips in the learned multimodal representations. Extensive experiments on the MovieQA dataset show the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
