DISCO-10M: A Large-Scale Music Dataset

Luca A. Lanzend\"orfer; Florian Gr\"otschla; Emil Funke; Roger; Wattenhofer

arXiv:2306.13512·cs.SD·October 6, 2023·1 cites

DISCO-10M: A Large-Scale Music Dataset

Luca A. Lanzend\"orfer, Florian Gr\"otschla, Emil Funke, Roger, Wattenhofer

PDF

Open Access 1 Datasets 1 Video

TL;DR

DISCO-10M is a large-scale, high-quality music dataset with precomputed embeddings designed to accelerate machine learning research in music by overcoming previous data limitations.

Contribution

We introduce DISCO-10M, the largest music dataset to date, with a multi-stage filtering process and precomputed CLAP embeddings for diverse downstream applications.

Findings

01

Surpasses previous music datasets by an order of magnitude in size.

02

Includes precomputed CLAP embeddings for immediate use.

03

Facilitates new machine learning research in music.

Abstract

Music datasets play a crucial role in advancing research in machine learning for music. However, existing music datasets suffer from limited size, accessibility, and lack of audio resources. To address these shortcomings, we present DISCO-10M, a novel and extensive music dataset that surpasses the largest previously available music dataset by an order of magnitude. To ensure high-quality data, we implement a multi-stage filtering process. This process incorporates similarities based on textual descriptions and audio embeddings. Moreover, we provide precomputed CLAP embeddings alongside DISCO-10M, facilitating direct application on various downstream tasks. These embeddings enable efficient exploration of machine learning applications on the provided data. With DISCO-10M, we aim to democratize and facilitate new research to help advance the development of novel machine learning models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

laion/LAION-DISCO-12M
dataset· 159 dl
159 dl

Videos

DISCO-10M: A Large-Scale Music Dataset· slideslive

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Diverse Musicological Studies