Global Structure-Aware Drum Transcription Based on Self-Attention Mechanisms
Ryoto Ishizuka, Ryo Nishikimi, Kazuyoshi Yoshii

TL;DR
This paper introduces a global structure-aware drum transcription method using self-attention mechanisms to directly estimate tatum-level scores from music signals, outperforming traditional RNN-based models especially with limited data.
Contribution
It proposes a novel deep model with self-attention and a regularized training approach using a pretrained score language model for improved drum transcription.
Findings
Outperforms RNN-based models in tatum-level error rate
Effective with limited paired training data
Enhances musical naturalness of estimated scores
Abstract
This paper describes an automatic drum transcription (ADT) method that directly estimates a tatum-level drum score from a music signal, in contrast to most conventional ADT methods that estimate the frame-level onset probabilities of drums. To estimate a tatum-level score, we propose a deep transcription model that consists of a frame-level encoder for extracting the latent features from a music signal and a tatum-level decoder for estimating a drum score from the latent features pooled at the tatum level. To capture the global repetitive structure of drum scores, which is difficult to learn with a recurrent neural network (RNN), we introduce a self-attention mechanism with tatum-synchronous positional encoding into the decoder. To mitigate the difficulty of training the self-attention-based model from an insufficient amount of paired data and improve the musical naturalness of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing
