DISCO-10M: A Large-Scale Music Dataset
Luca A. Lanzend\"orfer, Florian Gr\"otschla, Emil Funke, Roger, Wattenhofer

TL;DR
DISCO-10M is a large-scale, high-quality music dataset with precomputed embeddings designed to accelerate machine learning research in music by overcoming previous data limitations.
Contribution
We introduce DISCO-10M, the largest music dataset to date, with a multi-stage filtering process and precomputed CLAP embeddings for diverse downstream applications.
Findings
Surpasses previous music datasets by an order of magnitude in size.
Includes precomputed CLAP embeddings for immediate use.
Facilitates new machine learning research in music.
Abstract
Music datasets play a crucial role in advancing research in machine learning for music. However, existing music datasets suffer from limited size, accessibility, and lack of audio resources. To address these shortcomings, we present DISCO-10M, a novel and extensive music dataset that surpasses the largest previously available music dataset by an order of magnitude. To ensure high-quality data, we implement a multi-stage filtering process. This process incorporates similarities based on textual descriptions and audio embeddings. Moreover, we provide precomputed CLAP embeddings alongside DISCO-10M, facilitating direct application on various downstream tasks. These embeddings enable efficient exploration of machine learning applications on the provided data. With DISCO-10M, we aim to democratize and facilitate new research to help advance the development of novel machine learning models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Diverse Musicological Studies
