Learning to Answer Questions in Dynamic Audio-Visual Scenarios

Guangyao Li; Yake Wei; Yapeng Tian; Chenliang Xu; Ji-Rong Wen; Di; Hu

arXiv:2203.14072·cs.CV·April 6, 2022·5 cites

Learning to Answer Questions in Dynamic Audio-Visual Scenarios

Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, Di, Hu

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces a new large-scale dataset and a spatio-temporal grounded network for audio-visual question answering, demonstrating improved multimodal reasoning over videos.

Contribution

The paper presents the MUSIC-AVQA dataset and a novel network architecture for AVQA, advancing multimodal understanding and reasoning in dynamic scenes.

Findings

01

Our model outperforms recent AVQA approaches.

02

The dataset enables comprehensive evaluation of audio-visual reasoning.

03

Multisensory perception enhances question answering performance.

Abstract

In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to answer questions regarding different visual objects, sounds, and their associations in videos. The problem requires comprehensive multimodal understanding and spatio-temporal reasoning over audio-visual scenes. To benchmark this task and facilitate our study, we introduce a large-scale MUSIC-AVQA dataset, which contains more than 45K question-answer pairs covering 33 different question templates spanning over different modalities and question types. We develop several baselines and introduce a spatio-temporal grounded audio-visual network for the AVQA problem. Our results demonstrate that AVQA benefits from multisensory perception and our model outperforms recent A-, V-, and AVQA approaches. We believe that our built dataset has the potential to serve as testbed for evaluating and promoting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

GeWu-Lab/MUSIC-AVQA
pytorchOfficial

Datasets

MERA-evaluation/ruEnvAQA
dataset· 55 dl
55 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Video Analysis and Summarization