On the Effectiveness of Speech Self-supervised Learning for Music

Yinghao Ma; Ruibin Yuan; Yizhi Li; Ge Zhang; Xingran Chen; Hanzhi Yin,; Chenghua Lin; Emmanouil Benetos; Anton Ragni; Norbert Gyenge; Ruibo Liu; Gus; Xia; Roger Dannenberg; Yike Guo; Jie Fu

arXiv:2307.05161·cs.SD·July 12, 2023·2 cites

On the Effectiveness of Speech Self-supervised Learning for Music

Yinghao Ma, Ruibin Yuan, Yizhi Li, Ge Zhang, Xingran Chen, Hanzhi Yin,, Chenghua Lin, Emmanouil Benetos, Anton Ragni, Norbert Gyenge, Ruibo Liu, Gus, Xia, Roger Dannenberg, Yike Guo, Jie Fu

PDF

Open Access

TL;DR

This paper investigates the adaptation of speech self-supervised learning models to music information retrieval, demonstrating that music data training improves MIR performance but also highlighting limitations in modeling polyphonic music.

Contribution

It introduces music2vec and musicHuBERT, adapting speech SSL models for music, and systematically evaluates their effectiveness across multiple MIR tasks.

Findings

01

Training with music data enhances MIR task performance.

02

Speech SSL models have limitations in modeling polyphonic music.

03

Empirical guidelines for future musical SSL design are proposed.

Abstract

Self-supervised learning (SSL) has shown promising results in various speech and natural language processing applications. However, its efficacy in music information retrieval (MIR) still remains largely unexplored. While previous SSL models pre-trained on music recordings may have been mostly closed-sourced, recent speech models such as wav2vec2.0 have shown promise in music modelling. Nevertheless, research exploring the effectiveness of applying speech SSL models to music recordings has been limited. We explore the music adaption of SSL with two distinctive speech-related models, data2vec1.0 and Hubert, and refer to them as music2vec and musicHuBERT, respectively. We train $12$ SSL models with 95M parameters under various pre-training configurations and systematically evaluate the MIR task performances with 13 different MIR tasks. Our findings suggest that training with music data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Music Technology and Sound Studies