Towards Realistic Synthetic Data for Automatic Drum Transcription
Pierfrancesco Melucci, Paolo Merialdo, Taketo Akama

TL;DR
This paper presents a semi-supervised approach to create a large, diverse synthetic drum dataset from unlabeled audio, enabling training of a high-performance ADT model that surpasses previous methods.
Contribution
The authors introduce a novel semi-supervised method to automatically curate one-shot drum samples and synthesize training data, reducing reliance on paired datasets and domain gap issues.
Findings
Achieved state-of-the-art results on ENST and MDB datasets.
Outperformed fully supervised and previous synthetic-data methods.
Demonstrated effectiveness of synthetic data for ADT training.
Abstract
Deep learning models define the state-of-the-art in Automatic Drum Transcription (ADT), yet their performance is contingent upon large-scale, paired audio-MIDI datasets, which are scarce. Existing workarounds that use synthetic data often introduce a significant domain gap, as they typically rely on low-fidelity SoundFont libraries that lack acoustic diversity. While high-quality one-shot samples offer a better alternative, they are not available in a standardized, large-scale format suitable for training. This paper introduces a new paradigm for ADT that circumvents the need for paired audio-MIDI training data. Our primary contribution is a semi-supervised method to automatically curate a large and diverse corpus of one-shot drum samples from unlabeled audio sources. We then use this corpus to synthesize a high-quality dataset from MIDI files alone, which we use to train a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech Recognition and Synthesis
