Drum Synthesis from Expressive Drum Grids via Neural Audio Codecs
Konstantinos Soiledis, Maximos Kaliakatsos-Papakostas, Dimos Makris, Konstantinos Tsamis

TL;DR
This paper introduces a neural network system that converts expressive drum MIDI grids into realistic drum audio by predicting neural audio codec tokens, evaluated on the E-GMD dataset.
Contribution
It demonstrates that predicting neural audio codec tokens from drum grids is effective for realistic drum sound synthesis, comparing multiple state-of-the-art codecs.
Findings
Codec-token prediction yields high-fidelity drum audio
Transformer models effectively map MIDI to audio tokens
Choice of neural codec impacts audio quality
Abstract
Generating realistic drum audio directly from symbolic representations is a challenging task at the intersection of music perception and machine learning. We propose a system that transforms an expressive drum grid, a time-aligned MIDI representation with microtiming and velocity information, into drum audio by predicting discrete codes of a neural audio codec. Our approach uses a Transformer-based model to map the drum grid input to a sequence of codec tokens, which are then converted to waveform audio via a pre-trained codec decoder. We experiment with multiple state-of-the-art neural codecs, namely EnCodec, DAC, and X-Codec, to assess how the choice of audio representation impacts the quality of the generated drums. The system is trained and evaluated on the Expanded Groove MIDI Dataset, E-GMD, a large collection of human drum performances with paired MIDI and audio. We evaluate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
