MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization
Haina Zhu, Yizhi Zhou, Hangting Chen, Jianwei Yu, Ziyang Ma, Rongzhi, Gu, Yi Luo, Wei Tan, Xie Chen

TL;DR
MuQ introduces a self-supervised music representation learning model utilizing Mel Residual Vector Quantization, outperforming previous models on various music understanding tasks with less data and demonstrating strong zero-shot capabilities.
Contribution
The paper proposes MuQ, a novel self-supervised music representation model using Mel-RVQ for improved stability and efficiency, and introduces MuQ-MuLan for state-of-the-art zero-shot music tagging.
Findings
MuQ outperforms previous models on multiple downstream tasks.
Scaling data and iterative training further improve performance.
MuQ-MuLan achieves state-of-the-art zero-shot music tagging results.
Abstract
Recent years have witnessed the success of foundation models pre-trained with self-supervised learning (SSL) in various music informatics understanding tasks, including music tagging, instrument classification, key detection, and more. In this paper, we propose a self-supervised music representation learning model for music understanding. Distinguished from previous studies adopting random projection or existing neural codec, the proposed model, named MuQ, is trained to predict tokens generated by Mel Residual Vector Quantization (Mel-RVQ). Our Mel-RVQ utilizes residual linear projection structure for Mel spectrum quantization to enhance the stability and efficiency of target extraction and lead to better performance. Experiments in a large variety of downstream tasks demonstrate that MuQ outperforms previous self-supervised music representation models with only 0.9K hours of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
