Scaling Up Music Information Retrieval Training with Semi-Supervised Learning
Yun-Ning Hung, Ju-Chiang Wang, Minz Won, Duc Le

TL;DR
This paper demonstrates that scaling up both model size and unlabeled training data using semi-supervised learning significantly improves performance across multiple Music Information Retrieval tasks, achieving state-of-the-art results.
Contribution
It is the first to systematically study the combined effects of large-scale data and model size in semi-supervised MIR training.
Findings
Scaling data to 240k hours enhances model performance.
Increasing model size from 3M to 100M parameters improves results.
Large-scale semi-supervised training outperforms supervised and self-supervised methods.
Abstract
In the era of data-driven Music Information Retrieval (MIR), the scarcity of labeled data has been one of the major concerns to the success of an MIR task. In this work, we leverage the semi-supervised teacher-student training approach to improve MIR tasks. For training, we scale up the unlabeled music data to 240k hours, which is much larger than any public MIR datasets. We iteratively create and refine the pseudo-labels in the noisy teacher-student training process. Knowledge expansion is also explored to iteratively scale up the model sizes from as small as less than 3M to almost 100M parameters. We study the performance correlation between data size and model size in the experiments. By scaling up both model size and training data, our models achieve state-of-the-art results on several MIR tasks compared to models that are either trained in a supervised manner or based on a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
