Query-Guided Spatial-Temporal-Frequency Interaction for Music Audio-Visual Question Answering
Kun Li, Michael Ying Yang, Sami Sebastian Brandt

TL;DR
This paper introduces a novel query-guided spatial-temporal-frequency interaction method for audio-visual question answering, leveraging frequency domain features and context reasoning to improve multimodal understanding.
Contribution
It proposes a new QSTar interaction approach that incorporates question guidance and frequency features, enhancing AVQA performance beyond existing methods.
Findings
Significant performance improvements on AVQA benchmarks.
Effective utilization of frequency-domain audio features.
Enhanced focus on relevant audio-visual cues through context reasoning.
Abstract
Audio--Visual Question Answering (AVQA) is a challenging multimodal task that requires jointly reasoning over audio, visual, and textual information in a given video to answer natural language questions. Inspired by recent advances in Video QA, many existing AVQA approaches primarily focus on visual information processing, leveraging pre-trained models to extract object-level and motion-level representations. However, in those methods, the audio input is primarily treated as complementary to video analysis, and the textual question information contributes minimally to audio--visual understanding, as it is typically integrated only in the final stages of reasoning. To address these limitations, we propose a novel Query-guided Spatial--Temporal--Frequency (QSTar) interaction method, which effectively incorporates question-guided clues and exploits the distinctive frequency-domain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Music and Audio Processing · Speech and Audio Processing
