Learning Sparsity for Effective and Efficient Music Performance Question Answering
Xingjian Diao, Tianzhen Yang, Chunhui Zhang, Weiyi Wu, Ming Cheng, Jiang Gui

TL;DR
This paper introduces Sparsify, a sparse learning framework for Music AVQA that improves efficiency and performance by integrating sparsification strategies, reducing training time, and selecting key data subsets.
Contribution
It presents a novel sparsification framework for Music AVQA, achieving state-of-the-art results and significant efficiency improvements over dense models.
Findings
Achieves state-of-the-art performance on Music AVQA datasets.
Reduces training time by 28.32% compared to dense models.
Uses only 25% of training data while retaining 70-80% of full-data accuracy.
Abstract
Music performances, characterized by dense and continuous audio as well as seamless audio-visual integration, present unique challenges for multimodal scene understanding and reasoning. Recent Music Performance Audio-Visual Question Answering (Music AVQA) datasets have been proposed to reflect these challenges, highlighting the continued need for more effective integration of audio-visual representations in complex question answering. However, existing Music AVQA methods often rely on dense and unoptimized representations, leading to inefficiencies in the isolation of key information, the reduction of redundancy, and the prioritization of critical samples. To address these challenges, we introduce Sparsify, a sparse learning framework specifically designed for Music AVQA. It integrates three sparsification strategies into an end-to-end pipeline and achieves state-of-the-art performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Neuroscience and Music Perception
