Seconds-Aligned PCA-DAC Latent Diffusion for Symbolic-to-Audio Drum Rendering
Konstantinos Soiledis, Maximos Kaliakatsos Papakostas, Dimos Makris, Konstantinos Tsamis

TL;DR
This paper introduces Sec2Drum-DAC, a latent diffusion model for symbolic-to-audio drum rendering that preserves event timing and dynamics, improving spectral and transient metrics over baselines.
Contribution
The paper proposes a novel latent diffusion approach conditioned on event features, using PCA components for efficient waveform synthesis in symbolic drum rendering.
Findings
PCA diffusion outperforms deterministic PCA regression on spectral and transient metrics.
Auxiliary RVQ cross-entropy enhances short-step diffusion performance.
Optimal denoising steps range from 6 to 25 depending on the metric.
Abstract
Symbolic-control drum generation requires preserving explicit event timing and dynamics while synthesizing acoustically plausible waveforms. We present Sec2Drum-DAC, a conditional latent-diffusion model for symbolic-to-audio drum rendering. The model conditions on event features sampled in physical time at codec-frame locations and predicts standardized principal-component coordinates of frozen DAC summed-codebook embeddings rather than waveform samples. In the evaluated DAC configuration, 72 principal components capture the observed training-frame summed-latent subspace under the stated SVD threshold, yielding a compact continuous denoising target with a deterministic reconstruction path to the 1024-dimensional DAC latent space before waveform decoding. Across 1,733 held-out four-beat windows, PCA diffusion improves paired spectral and transient metrics over deterministic PCA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
