Patch-level Sounding Object Tracking for Audio-Visual Question Answering

Zhangbin Li; Jinxing Zhou; Jing Zhang; Shengeng Tang; Kun Li; Dan Guo

arXiv:2412.10749·cs.MM·December 17, 2024·2 cites

Patch-level Sounding Object Tracking for Audio-Visual Question Answering

Zhangbin Li, Jinxing Zhou, Jing Zhang, Shengeng Tang, Kun Li, Dan Guo

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel patch-level tracking method for audio-visual question answering, combining motion and sound cues to improve the identification of relevant objects over time.

Contribution

It proposes a new multi-module tracking framework that integrates motion-driven, sound-driven, and question-driven modules for better object tracking in AVQA tasks.

Findings

01

Achieves competitive performance on standard datasets.

02

Effectively tracks sounding objects using combined cues.

03

Outperforms some large-scale pretraining approaches.

Abstract

Answering questions related to audio-visual scenes, i.e., the AVQA task, is becoming increasingly popular. A critical challenge is accurately identifying and tracking sounding objects related to the question along the timeline. In this paper, we present a new Patch-level Sounding Object Tracking (PSOT) method. It begins with a Motion-driven Key Patch Tracking (M-KPT) module, which relies on visual motion information to identify salient visual patches with significant movements that are more likely to relate to sounding objects and questions. We measure the patch-wise motion intensity map between neighboring video frames and utilize it to construct and guide a motion-driven graph network. Meanwhile, we design a Sound-driven KPT (S-KPT) module to explicitly track sounding patches. This module also involves a graph network, with the adjacency matrix regularized by the audio-visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Patch-level Sounding Object Tracking for Audio-Visual Question Answering· underline

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Video Surveillance and Tracking Methods