Learning music audio representations via weak language supervision

Ilaria Manco; Emmanouil Benetos; Elio Quinton; Gyorgy Fazekas

arXiv:2112.04214·cs.SD·February 18, 2022·1 cites

Learning music audio representations via weak language supervision

Ilaria Manco, Emmanouil Benetos, Elio Quinton, Gyorgy Fazekas

PDF

Open Access 1 Repo

TL;DR

This paper introduces MuLaP, a multimodal pre-training method that uses weakly aligned text descriptions to learn versatile music audio representations, achieving competitive results across various music classification and regression tasks.

Contribution

The paper presents a novel weakly supervised pre-training approach using text descriptions for general-purpose music audio representations, reducing reliance on extensive annotations.

Findings

01

MuLaP achieves comparable or superior performance to existing methods.

02

The approach effectively leverages audio-caption pairs for representation learning.

03

Pre-trained models transfer well to multiple music-related tasks.

Abstract

Audio representations for music information retrieval are typically learned via supervised learning in a task-specific fashion. Although effective at producing state-of-the-art results, this scheme lacks flexibility with respect to the range of applications a model can have and requires extensively annotated datasets. In this work, we pose the question of whether it may be possible to exploit weakly aligned text as the only supervisory signal to learn general-purpose music audio representations. To address this question, we design a multimodal architecture for music and language pre-training (MuLaP) optimised via a set of proxy tasks. Weak supervision is provided in the form of noisy natural language descriptions conveying the overall musical content of the track. After pre-training, we transfer the audio backbone of the model to a set of music audio classification and regression tasks.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ilaria-manco/mulap
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Diverse Musicological Studies · Speech Recognition and Synthesis