Query-Guided Spatial-Temporal-Frequency Interaction for Music Audio-Visual Question Answering

Kun Li; Michael Ying Yang; Sami Sebastian Brandt

arXiv:2601.19821·cs.CV·March 10, 2026

Query-Guided Spatial-Temporal-Frequency Interaction for Music Audio-Visual Question Answering

Kun Li, Michael Ying Yang, Sami Sebastian Brandt

PDF

Open Access

TL;DR

This paper introduces a novel query-guided spatial-temporal-frequency interaction method for audio-visual question answering, leveraging frequency domain features and context reasoning to improve multimodal understanding.

Contribution

It proposes a new QSTar interaction approach that incorporates question guidance and frequency features, enhancing AVQA performance beyond existing methods.

Findings

01

Significant performance improvements on AVQA benchmarks.

02

Effective utilization of frequency-domain audio features.

03

Enhanced focus on relevant audio-visual cues through context reasoning.

Abstract

Audio--Visual Question Answering (AVQA) is a challenging multimodal task that requires jointly reasoning over audio, visual, and textual information in a given video to answer natural language questions. Inspired by recent advances in Video QA, many existing AVQA approaches primarily focus on visual information processing, leveraging pre-trained models to extract object-level and motion-level representations. However, in those methods, the audio input is primarily treated as complementary to video analysis, and the textual question information contributes minimally to audio--visual understanding, as it is typically integrated only in the final stages of reasoning. To address these limitations, we propose a novel Query-guided Spatial--Temporal--Frequency (QSTar) interaction method, which effectively incorporates question-guided clues and exploits the distinctive frequency-domain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Music and Audio Processing · Speech and Audio Processing