Patch-level Sounding Object Tracking for Audio-Visual Question Answering
Zhangbin Li, Jinxing Zhou, Jing Zhang, Shengeng Tang, Kun Li, Dan Guo

TL;DR
This paper introduces a novel patch-level tracking method for audio-visual question answering, combining motion and sound cues to improve the identification of relevant objects over time.
Contribution
It proposes a new multi-module tracking framework that integrates motion-driven, sound-driven, and question-driven modules for better object tracking in AVQA tasks.
Findings
Achieves competitive performance on standard datasets.
Effectively tracks sounding objects using combined cues.
Outperforms some large-scale pretraining approaches.
Abstract
Answering questions related to audio-visual scenes, i.e., the AVQA task, is becoming increasingly popular. A critical challenge is accurately identifying and tracking sounding objects related to the question along the timeline. In this paper, we present a new Patch-level Sounding Object Tracking (PSOT) method. It begins with a Motion-driven Key Patch Tracking (M-KPT) module, which relies on visual motion information to identify salient visual patches with significant movements that are more likely to relate to sounding objects and questions. We measure the patch-wise motion intensity map between neighboring video frames and utilize it to construct and guide a motion-driven graph network. Meanwhile, we design a Sound-driven KPT (S-KPT) module to explicitly track sounding patches. This module also involves a graph network, with the adjacency matrix regularized by the audio-visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Video Surveillance and Tracking Methods
