Target-Aware Spatio-Temporal Reasoning via Answering Questions in   Dynamics Audio-Visual Scenarios

Yuanyuan Jiang; Jianqin Yin

arXiv:2305.12397·cs.CV·December 11, 2023·1 cites

Target-Aware Spatio-Temporal Reasoning via Answering Questions in Dynamics Audio-Visual Scenarios

Yuanyuan Jiang, Jianqin Yin

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel target-aware joint spatio-temporal reasoning network for audio-visual question answering, effectively integrating spatial and temporal grounding with a focus on query-relevant cues and cross-modal synchronization.

Contribution

It proposes a unified target-aware spatial grounding module and a single-stream joint temporal grounding module with cross-modal synchrony loss, advancing AVQA performance.

Findings

01

Outperforms existing state-of-the-art methods

02

Effective spatial and temporal grounding in AVQA

03

Improved cross-modal synchronization

Abstract

Audio-visual question answering (AVQA) is a challenging task that requires multistep spatio-temporal reasoning over multimodal contexts. Recent works rely on elaborate target-agnostic parsing of audio-visual scenes for spatial grounding while mistreating audio and video as separate entities for temporal grounding. This paper proposes a new target-aware joint spatio-temporal grounding network for AVQA. It consists of two key components: the target-aware spatial grounding module (TSG) and the single-stream joint audio-visual temporal grounding module (JTG). The TSG can focus on audio-visual cues relevant to the query subject by utilizing explicit semantics from the question. Unlike previous two-stream temporal grounding modules that required an additional audio-visual fusion module, JTG incorporates audio-visual fusion and question-aware temporal grounding into one module with a simpler…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Bravo5542/TJSTG
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Subtitles and Audiovisual Media