Discogs-VI: A Musical Version Identification Dataset Based on Public Editorial Metadata
R. Oguz Araz, Xavier Serra, Dmitry Bogdanov

TL;DR
This paper introduces Discogs-VI, a large and diverse musical version identification dataset derived from the Discogs database, enabling more robust neural network training and realistic evaluations in music version identification tasks.
Contribution
The paper presents a novel, large-scale dataset for music version identification based on editorial metadata, significantly expanding existing resources and providing tools and baseline models for the community.
Findings
The dataset contains approximately 1.9 million versions across 348,000 cliques.
Mapping to YouTube uploads results in about 493,000 versions across 98,000 cliques.
A baseline neural network trained on this dataset achieves competitive results on standard benchmarks.
Abstract
Current version identification (VI) datasets often lack sufficient size and musical diversity to train robust neural networks (NNs). Additionally, their non-representative clique size distributions prevent realistic system evaluations. To address these challenges, we explore the untapped potential of the rich editorial metadata in the Discogs music database and create a large dataset of musical versions containing about 1,900,000 versions across 348,000 cliques. Utilizing a high-precision search algorithm, we map this dataset to official music uploads on YouTube, resulting in a dataset of approximately 493,000 versions across 98,000 cliques. This dataset offers over nine times the number of cliques and over four times the number of versions than existing datasets. We demonstrate the utility of our dataset by training a baseline NN without extensive model complexities or data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Diverse Musicological Studies · Music History and Culture
