Music Audio-Visual Question Answering Requires Specialized Multimodal Designs

Wenhao You; Xingjian Diao; Wenjun Huang; Chunhui Zhang; Keyi Kong; Weiyi Wu; Chiyu Ma; Zhongyu Ouyang; Tingxuan Wu; Ming Cheng; Soroush Vosoughi; Jiang Gui

arXiv:2505.20638·cs.SD·April 13, 2026

Music Audio-Visual Question Answering Requires Specialized Multimodal Designs

Wenhao You, Xingjian Diao, Wenjun Huang, Chunhui Zhang, Keyi Kong, Weiyi Wu, Chiyu Ma, Zhongyu Ouyang, Tingxuan Wu, Ming Cheng, Soroush Vosoughi, Jiang Gui

PDF

1 Repo

TL;DR

This paper analyzes the unique challenges of Music Audio-Visual Question Answering, emphasizing the need for specialized multimodal architectures and providing insights and design patterns for future research.

Contribution

It systematically examines Music AVQA datasets and methods, highlighting the importance of domain-specific input processing and architectures tailored for musical content.

Findings

01

Specialized input processing is crucial for Music AVQA.

02

Dedicated spatial-temporal architectures improve performance.

03

Incorporating musical priors can enhance multimodal understanding.

Abstract

While recent Multimodal Large Language Models exhibit impressive capabilities for general multimodal tasks, specialized domains like music necessitate tailored approaches. Music Audio-Visual Question Answering (Music AVQA) particularly underscores this, presenting unique challenges with its continuous, densely layered audio-visual content, intricate temporal dynamics, and the critical need for domain-specific knowledge. Through a systematic analysis of Music AVQA datasets and methods, this paper identifies that specialized input processing, architectures incorporating dedicated spatial-temporal designs, and music-specific modeling strategies are critical for success in this domain. Our study provides valuable insights for researchers by highlighting effective design patterns empirically linked to strong performance, proposing concrete future directions for incorporating musical priors,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

WenhaoYou1/Survey4MusicAVQA
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.