Learning Sparsity for Effective and Efficient Music Performance Question Answering

Xingjian Diao; Tianzhen Yang; Chunhui Zhang; Weiyi Wu; Ming Cheng; Jiang Gui

arXiv:2506.01319·cs.SD·June 3, 2025

Learning Sparsity for Effective and Efficient Music Performance Question Answering

Xingjian Diao, Tianzhen Yang, Chunhui Zhang, Weiyi Wu, Ming Cheng, Jiang Gui

PDF

Open Access 1 Video

TL;DR

This paper introduces Sparsify, a sparse learning framework for Music AVQA that improves efficiency and performance by integrating sparsification strategies, reducing training time, and selecting key data subsets.

Contribution

It presents a novel sparsification framework for Music AVQA, achieving state-of-the-art results and significant efficiency improvements over dense models.

Findings

01

Achieves state-of-the-art performance on Music AVQA datasets.

02

Reduces training time by 28.32% compared to dense models.

03

Uses only 25% of training data while retaining 70-80% of full-data accuracy.

Abstract

Music performances, characterized by dense and continuous audio as well as seamless audio-visual integration, present unique challenges for multimodal scene understanding and reasoning. Recent Music Performance Audio-Visual Question Answering (Music AVQA) datasets have been proposed to reflect these challenges, highlighting the continued need for more effective integration of audio-visual representations in complex question answering. However, existing Music AVQA methods often rely on dense and unoptimized representations, leading to inefficiencies in the isolation of key information, the reduction of redundancy, and the prioritization of critical samples. To address these challenges, we introduce Sparsify, a sparse learning framework specifically designed for Music AVQA. It integrates three sparsification strategies into an end-to-end pipeline and achieves state-of-the-art performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Learning Sparsity for Effective and Efficient Music Performance Question Answering· underline

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Neuroscience and Music Perception