Triple Attention Network architecture for MovieQA
Ankit Shah, Tzu-Hsiang Lin, Shijie Wu

TL;DR
This paper introduces a triple-attention network architecture for MovieQA that incorporates audio alongside video and text, improving answer prediction accuracy by leveraging complementary multimedia information.
Contribution
It presents a novel triple-attention network that effectively integrates audio into MovieQA, enhancing performance over traditional visual-text models.
Findings
Inclusion of audio improves MovieQA accuracy by about 7%.
Triple-attention architecture captures complementary information from audio, video, and text.
Experiments demonstrate the effectiveness of the proposed approach.
Abstract
Movie question answering, or MovieQA is a multimedia related task wherein one is provided with a video, the subtitle information, a question and candidate answers for it. The task is to predict the correct answer for the question using the components of the multimedia - namely video/images, audio and text. Traditionally, MovieQA is done using the image and text component of the multimedia. In this paper, we propose a novel network with triple-attention architecture for the inclusion of audio in the Movie QA task. This architecture is fashioned after a traditional dual attention network focused only on video and text. Experiments show that the inclusion of audio using the triple-attention network results provides complementary information for Movie QA task which is not captured by visual or textual component in the data. Experiments with a wide range of audio features show that using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
