Learning music audio representations via weak language supervision
Ilaria Manco, Emmanouil Benetos, Elio Quinton, Gyorgy Fazekas

TL;DR
This paper introduces MuLaP, a multimodal pre-training method that uses weakly aligned text descriptions to learn versatile music audio representations, achieving competitive results across various music classification and regression tasks.
Contribution
The paper presents a novel weakly supervised pre-training approach using text descriptions for general-purpose music audio representations, reducing reliance on extensive annotations.
Findings
MuLaP achieves comparable or superior performance to existing methods.
The approach effectively leverages audio-caption pairs for representation learning.
Pre-trained models transfer well to multiple music-related tasks.
Abstract
Audio representations for music information retrieval are typically learned via supervised learning in a task-specific fashion. Although effective at producing state-of-the-art results, this scheme lacks flexibility with respect to the range of applications a model can have and requires extensively annotated datasets. In this work, we pose the question of whether it may be possible to exploit weakly aligned text as the only supervisory signal to learn general-purpose music audio representations. To address this question, we design a multimodal architecture for music and language pre-training (MuLaP) optimised via a set of proxy tasks. Weak supervision is provided in the form of noisy natural language descriptions conveying the overall musical content of the track. After pre-training, we transfer the audio backbone of the model to a set of music audio classification and regression tasks.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Diverse Musicological Studies · Speech Recognition and Synthesis
