Codified audio language modeling learns useful representations for music   information retrieval

Rodrigo Castellon; Chris Donahue; Percy Liang

arXiv:2107.05677·cs.SD·July 14, 2021·22 cites

Codified audio language modeling learns useful representations for music information retrieval

Rodrigo Castellon, Chris Donahue, Percy Liang

PDF

Open Access 1 Repo 3 Datasets

TL;DR

Pre-trained language models on codified music audio, like Jukebox, produce representations that significantly improve performance on various music information retrieval tasks compared to traditional tagging models.

Contribution

This work demonstrates that codified audio language modeling with Jukebox yields richer representations for MIR tasks, outperforming conventional pre-trained tagging models.

Findings

01

Jukebox representations improve MIR task performance by 30%.

02

Representations are especially strong for key detection.

03

Modeling audio directly captures richer information than tags.

Abstract

We demonstrate that language models pre-trained on codified (discretely-encoded) music audio learn representations that are useful for downstream MIR tasks. Specifically, we explore representations from Jukebox (Dhariwal et al. 2020): a music generation system containing a language model trained on codified audio from 1M songs. To determine if Jukebox's representations contain useful information for MIR, we use them as input features to train shallow models on several MIR tasks. Relative to representations from conventional MIR models which are pre-trained on tagging, we find that using representations from Jukebox as input features yields 30% stronger performance on average across four MIR tasks: tagging, genre classification, emotion recognition, and key detection. For key detection, we observe that representations from Jukebox are considerably stronger than those from models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

p-lambda/jukemir
noneOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech Recognition and Synthesis

MethodsDense Connections · Position-Wise Feed-Forward Layer · Dilated Convolution · Layer Normalization · Convolution · Residual Connection · VQ-VAE · Jukebox