D3RM: A Discrete Denoising Diffusion Refinement Model for Piano Transcription
Hounsu Kim, Taegyun Kwon, Juhan Nam

TL;DR
This paper introduces D3RM, a novel discrete diffusion model with Neighborhood Attention for improved piano transcription, outperforming previous diffusion-based methods on the MAESTRO dataset.
Contribution
The paper proposes a new architecture for piano transcription using discrete diffusion models with Neighborhood Attention and a novel training-inference transition strategy.
Findings
Outperforms previous diffusion-based models in F1 score on MAESTRO
Utilizes Neighborhood Attention layers for denoising
Employs a novel transition strategy during training and inference
Abstract
Diffusion models have been widely used in the generative domain due to their convincing performance in modeling complex data distributions. Moreover, they have shown competitive results on discriminative tasks, such as image segmentation. While diffusion models have also been explored for automatic music transcription, their performance has yet to reach a competitive level. In this paper, we focus on discrete diffusion model's refinement capabilities and present a novel architecture for piano transcription. Our model utilizes Neighborhood Attention layers as the denoising module, gradually predicting the target high-resolution piano roll, conditioned on the finetuned features of a pretrained acoustic model. To further enhance refinement, we devise a novel strategy which applies distinct transition states during training and inference stage of discrete diffusion models. Experiments on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Diverse Musicological Studies
MethodsSoftmax · Attention Is All You Need · Diffusion · Neighborhood Attention · Focus
