Triple Attention Network architecture for MovieQA

Ankit Shah; Tzu-Hsiang Lin; Shijie Wu

arXiv:2111.09531·cs.MM·November 19, 2021

Triple Attention Network architecture for MovieQA

Ankit Shah, Tzu-Hsiang Lin, Shijie Wu

PDF

Open Access

TL;DR

This paper introduces a triple-attention network architecture for MovieQA that incorporates audio alongside video and text, improving answer prediction accuracy by leveraging complementary multimedia information.

Contribution

It presents a novel triple-attention network that effectively integrates audio into MovieQA, enhancing performance over traditional visual-text models.

Findings

01

Inclusion of audio improves MovieQA accuracy by about 7%.

02

Triple-attention architecture captures complementary information from audio, video, and text.

03

Experiments demonstrate the effectiveness of the proposed approach.

Abstract

Movie question answering, or MovieQA is a multimedia related task wherein one is provided with a video, the subtitle information, a question and candidate answers for it. The task is to predict the correct answer for the question using the components of the multimedia - namely video/images, audio and text. Traditionally, MovieQA is done using the image and text component of the multimedia. In this paper, we propose a novel network with triple-attention architecture for the inclusion of audio in the Movie QA task. This architecture is fashioned after a traditional dual attention network focused only on video and text. Experiments show that the inclusion of audio using the triple-attention network results provides complementary information for Movie QA task which is not captured by visual or textual component in the data. Experiments with a wide range of audio features show that using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques