Progressive Spatio-temporal Perception for Audio-Visual Question   Answering

Guangyao Li; Wenxuan Hou; Di Hu

arXiv:2308.05421·cs.CV·August 11, 2023

Progressive Spatio-temporal Perception for Audio-Visual Question Answering

Guangyao Li, Wenxuan Hou, Di Hu

PDF

Open Access 1 Repo

TL;DR

This paper introduces PSTP-Net, a progressive network that identifies key spatio-temporal regions in videos for improved audio-visual question answering by focusing on question-relevant content.

Contribution

The paper proposes a novel multi-module network that progressively selects relevant video segments and regions, enhancing AVQA performance by filtering out irrelevant information.

Findings

01

Outperforms existing AVQA models on MUSIC-AVQA and AVQA datasets.

02

Demonstrates improved accuracy and efficiency in question answering.

03

Validates the effectiveness of progressive spatio-temporal perception modules.

Abstract

Audio-Visual Question Answering (AVQA) task aims to answer questions about different visual objects, sounds, and their associations in videos. Such naturally multi-modal videos are composed of rich and complex dynamic audio-visual components, where most of which could be unrelated to the given questions, or even play as interference in answering the content of interest. Oppositely, only focusing on the question-aware audio-visual content could get rid of influence, meanwhile enabling the model to answer more efficiently. In this paper, we propose a Progressive Spatio-Temporal Perception Network (PSTP-Net), which contains three modules that progressively identify key spatio-temporal regions w.r.t. questions. Specifically, a temporal segment selection module is first introduced to select the most relevant audio-visual segments related to the given question. Then, a spatial region…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gewu-lab/pstp-net
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Advanced Image and Video Retrieval Techniques