Learning Music Representations with wav2vec 2.0
Alessandro Ragano, Emmanouil Benetos, Andrew Hines

TL;DR
Pre-training wav2vec 2.0 on music data enhances its ability to encode musical concepts and improves performance on music classification tasks compared to speech-based models.
Contribution
This paper demonstrates that pre-training wav2vec 2.0 directly on music data yields better music representations than adapting speech models.
Findings
Pre-training on music data encodes pitch and instrument information.
Fine-tuned music pre-trained wav2vec 2.0 achieves competitive classification results.
Music pre-trained wav2vec 2.0 outperforms speech pre-trained models on music tasks.
Abstract
Learning music representations that are general-purpose offers the flexibility to finetune several downstream tasks using smaller datasets. The wav2vec 2.0 speech representation model showed promising results in many downstream speech tasks, but has been less effective when adapted to music. In this paper, we evaluate whether pre-training wav2vec 2.0 directly on music data can be a better solution instead of finetuning the speech model. We illustrate that when pre-training on music data, the discrete latent representations are able to encode the semantic meaning of musical concepts such as pitch and instrument. Our results show that finetuning wav2vec 2.0 pre-trained on music data allows us to achieve promising results on music classification tasks that are competitive with prior work on audio representations. In addition, the results are superior to the pre-trained model on speech…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Diverse Musicological Studies
