Deep Unsupervised Drum Transcription
Keunwoo Choi, Kyunghyun Cho

TL;DR
This paper presents DrummerNet, an unsupervised deep learning system for drum transcription that learns from unlabeled data by minimizing reconstruction error, outperforming many existing methods.
Contribution
Introduces DrummerNet, a novel unsupervised neural network approach for drum transcription that does not require ground-truth labels and leverages large unlabeled datasets.
Findings
Performs favorably compared to recent supervised and unsupervised systems
Successfully learns drum transcription without ground-truth annotations
Demonstrates scalability with large unlabeled datasets
Abstract
We introduce DrummerNet, a drum transcription system that is trained in an unsupervised manner. DrummerNet does not require any ground-truth transcription and, with the data-scalability of deep neural networks, learns from a large unlabeled dataset. In DrummerNet, the target drum signal is first passed to a (trainable) transcriber, then reconstructed in a (fixed) synthesizer according to the transcription estimate. By training the system to minimize the distance between the input and the output audio signals, the transcriber learns to transcribe without ground truth transcription. Our experiment shows that DrummerNet performs favorably compared to many other recent drum transcription systems, both supervised and unsupervised.
| Name | Description | Note |
|---|---|---|
| The temporal index/length of audio input | ||
| The index/total number of drum components | ||
| , | Mixture and transcription | |
| , | Estimations of mixture/transcription |
| Class | Subclass | Description | ||||
| KD | KD | Kick drum | ||||
| SD | SD | Snare drum | ||||
| HH | CHH, PHH | Closed/pedalled hi-hat | ||||
| OHH | Open hi-hat | |||||
| TT | HIT, MHT, | High/high-mid/ | ||||
| HFT, LFT* | high-floor/low-floor tom | |||||
| CY | RDC, RDB* | Ride cymbal, ride cymbal bell | ||||
|
|
|||||
| OT | SST*, TMB | side stick, tambourine |
| Module | Input (size) | Output (size) |
|---|---|---|
| U-net encoder | ||
| Conv1D | : | |
| Conv1D | ||
| U-net decoder | ||
| Conv1D | : | |
| Recurrent layers | ||
| Sparsemax | ||
| Upsampler | : | |
| Synthesis module | ||
| Channel splitter | : | : |
| Each Conv1D | : | : |
| Sum (mixer) | : | : |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
Deep Unsupervised Drum Transcription
Abstract
We introduce DrummerNet, a drum transcription system that is trained in an unsupervised manner. DrummerNet does not require any ground-truth transcription and, with the data-scalability of deep neural networks, learns from a large unlabeled dataset. In DrummerNet, the target drum signal is first passed to a (trainable) transcriber, then reconstructed in a (fixed) synthesizer according to the transcription estimate. By training the system to minimize the distance between the input and the output audio signals, the transcriber learns to transcribe without ground truth transcription. Our experiment shows that DrummerNet performs favorably compared to many other recent drum transcription systems, both supervised and unsupervised.
1 Introduction
Transcription is a music information retrieval task with the goal of estimating the score when input audio is given. The majority of the recent transcription systems is based on supervised learning, where the transcriber is an analysis system that is trained with annotated pairs to minimize the distance between and [27, 6, 31, 37, 38, 33, 34, 7].
The trend is similar in drum transcription on which we focus in this paper. Supervised learning approaches may incorporate models based on frame-based feature extraction and classification [15], non-negative matrix factorization (NMF) for pattern matching [10], or hidden-Markov model [25]. More attention has been given recently to deep learning based models such as convolutional neural networks (CNNs, [13, 34]) and recurrent neural networks (RNNs, [37, 38, 33]), all of which have greatly improved transcription systems.
However, the lack of a large-scale annotated dataset is one of the most frequently mentioned obstacles that hinder further improvement. In practice, this limits the generalizeability of supervised learning systems, as will be discussed in Section 4, and using synthetic data is one way to address this issue [7, 39]. Although there have been proposals to use unlabeled data [42, 43], the issue remains as they still rely on supervised learning combined with teacher-student learning [16]. Parallel to those approaches, an annotation-free and, therefore, a more scalable and generalizable alternative would be unsupervised learning.
Unsurprisingly, one of the humans’ music learning procedures, self-taught by trial-and-error, is very similar to unsupervised learning. For example, musicians learn to transcribe by (a) listening, (b) playing an instrument, (c) identifying differences, and (d) making adjustments. Can this be done without any supervision? Yes, if the person can spot the pitch difference (e.g., the pitch should be higher or lower). Consistent with this logic, developing a transcription system based on unsupervised learning would be feasible if the system can test the estimation, measure the error, and correct itself accordingly.
To implement such an unsupervised transcription system, we need a synthesis system, , making the overall system . During its training, the system is given and trained to minimize the distance between and . There have been few systems relying on unsupervised learning as explained above. In MIR, the system in [1] utilized sparse coding to learn a dictionary of time-frequency templates of piano and harpsicord, assuming a (matrix-)multiplication model with additive noise, . Yoshii et al. proposed to use sparse coding in a jointly-learned chord recognition and transcription system [44]. Berg et al. designed a probabilistic graphical model that parameterizes the spectral and temporal envelopes, note events, and note activations, in order to transcribe piano by inferring their parameters [2]. In drum transcription, many systems have used NMF to decompose a drum track spectrum into spectral templates and their temporal activations (or transcription) [26, 41]. Several variants of NMF were proposed to address the limits of the fixed spectrum template of NMF [29, 19, 20]. Lastly, a similar system can be found in computer vision, where the parameters of input images are estimated by reconstruction using an image renderer [18].
In this paper, we introduce DrummerNet, a deep neural network based drum transcription system that is trained by unsupervised learning. With a more end-to-end approach, DrummerNet is distinguished from previous research [1, 44, 2], which has strong priors on the target sounds. In §2, we present the system design principle behind DrummerNet, followed by its details in §3. In §4, the evaluation results are discussed along with the ablation study. We present our conclusion, the problems of our system, and the future direction towards fully unsupervised learning in transcription/MIR in §5.
2 System Design Principles
Training the proposed DrummerNet is similar to the previous unsupervised learning approaches in music [1, 44, 2], as they all train a system to output that reconstructs the original signal . The difference between and works as a proxy of the difference between to .
There are three conditions under which unsupervised learning of a transcriber can be achieved successfully. First, the output of the analysis module must be in the form of transcription – a set of discrete events representing the timing and intensity of the notes. Second, the synthesis module must synthesize the audio signal given the transcription input . Third, all the components in the network must be differentiable as we rely on backpropagation to train it.
3 DrummerNet
In this section, we introduce the proposed system structure. We specify the number of channels, kernel size, and stride as (channel, kernel, stride). All the convolutional and recurrent layers use an exponential linear unit as an activation function [9]. 111The implementation of DrummerNet is available on https://github.com/keunwoochoi/DrummerNet
3.1 Analysis module
The analysis module , as illustrated in the top half of Figure 1, takes the audio signal as an input and processes it through a series of U-net variant [30], recurrent layers, and gated Sparsemax activation [21]. After training, this module is used as a transcriber (with peak-picking).
U-net
The U-net consists of 1D convolutional layers, max-pooling layers, and concatenations between the encoder and the decoder. The encoder consists of a convolutional layer (128, 3, 1) followed by 10 convolutional layers (50, 3, 1) interleaved with max-pooling of size 2. As a result, it outputs which has a receptive field size of 3,072 time steps.
The decoder has only 6 convolutional layers (50, 3, 1) interleaved with a concatenation with the feature map at the same depth as in the encoder and a 2 bi-linear interpolation. We call the output of decoder , the representation based on which the transcription is estimated. The asymmetry between the encoder and the decoder makes the length to be shorter by a factor of compared to that of input . Assuming the input audio is sampled at 16 kHz,222 This is the sampling rate of input audio in our experiment. would have a sampling rate of 1,000 Hz.
Recurrent layers
We use three recurrent layers: (GRUs [8]) {along time-axis, bi-directional, 100-channel}, {along time-axis, uni-directional, 50-channel}, and {along channel-axis, uni-directional, -channel}. These three recurrent layers have properties of i) being bi-directional so that the onset at can be determined by the vicinity of (both the past and the future), ii) enforcing temporal dependency, and iii) enforcing component-wise dependency, respectively. The width (or the hidden vector size) of the third recurrent layer is equal to , the number of drum components in the synthesizer, to map each channel to each drum component.
Sparsemax
In an ideal case of transcription, there would be local sparsity along both the time and channel-axes because the drum events would not repeat with a rate of 1,000 Hz (which is faster than 16-beat on 240 BPM), nor would all the drum components be activated simultaneously. Although sparsity is one of the properties that can be achieved by the autoregressive nature of the recurrent layers, we add Sparsemax [21] activation to encourage it explicitly. The output of Sparsemax has two important properties: i) it always sums to 1 (same as Softmax) and ii) it is highly likely to be sparse with actual zeros (unlike Softmax). In DrummerNet, two Sparsemax layers are applied in parallel, one along channel-axis (=instrument-axis) and the other time-axis within a non-overlapping window size of 64. This design choice is based on the assumption that there are only a few onsets among notes (channel-axis sparsity) and within 64 samples at , or 64 ms (temporal sparsity). The outputs from these two Sparsemax layers are then multiplied element-wise.
Upsampler
Finally, the low temporal resolution of the Sparsemax output is addressed by zero-insertion upsampling by the factor of 16. According to this, we modify the temporal quantization rate of events, unlike the upsampling of a digital signal.
3.2 Synthesis module
The synthesis module consists of parallel 1D convolutional layers and a channel-wise summing operator. The kernel of each layer is not trained but fixed to the known waveform of each drum component to convert a transcription of a component into a track . The tracks are summed to generate the final output (), the synthesized audio signal. This module is only used during training.
In the implementation, we use , using Subclass in Table 2, following [36]. Ones marked with asterisks were excluded due to their scarcities in our source of isolated drum recordings, which consisted of 12 virtual drum instruments provided by Logic Pro X. Multiple drum kits, including rock, pop, funk, and soul333 Brooklyn, Heavy, Liverpool, Neo Soul, Detroit Garage, Motown Revisited, Portland, Sunset, Speakeasy, SoCal, Smash, and Slow Jam. All with velocity=98., were used to prevent the network from overfitting to a specific drum kit. During training, a drum kit was randomly assigned for every batch.
3.3 Learning
Unable to directly compute the transcription loss during unsupervised learning, we carefully designed a loss function at the audio level, , as minimizing it would also minimize the transcription loss, . To do so, should be able to differentiate the drum components – kick drum (KD), snare drum (SD), and hi-hat (HH) – while being invariant to the varying drum kits. Perceptually, there are clear differences between KD, SD, and HH. Although both impulsive, KD is in the low-frequency band while SD is in the mid-frequency band. SD is also relatively tonal and has a longer envelope. HH is more complicated to describe due to its variation from its playing technique. For example, closed and pedalled-HH’s are in the high-frequency band, impulsive, and with relatively low energy, while open-HH’s are similar except louder with a longer noisy envelope.
We thus define and use onset spectrum similarity, which is designed to represent the similarity based on the onset part of sounds in the spectrum domain. As illustrated in 2, it is measured by i) applying median-filtering based drum extraction [12] which enhances onsets (with a FFT size of 1024 and median filter length of 31 on both axes), ii) converting to multi-resolution CQTs (constant-Q transform) for both and , and then iii) calculating the mean absolute difference between them.
Among many spectral magnitude representations, we use (log-magnitude) CQT since the logarithmic frequency scale is known to match well to human auditory perception [23]. We followed the implementation of Pseudo-CQT444 http://librosa.github.io/librosa/
which multiplies linear-to-octave filterbanks to an STFT. As a result, the CQT covered nearly 8-octave bands from 32.07 Hz (C1) to 8 kHz (the Nyquist frequency of our experiment) with a 12-band/octave resolution. This implementation is differentiable.
Figure 3 shows the effect of onset enhancement. It preserves the characteristics of the drum components in the transient part while removing the after-onset components. This process makes and more similar, as the non-transient parts vary more among drum kits due to their random and noisy nature. In a preliminary experiment, for example, the network tried to reconstruct all the non-transient components of SD using tom-toms and HHs, resulting in non-sparse and severe false-positive detection of onsets.
4 Experiments and Analysis
4.1 Setup
For the training of DrummerNet, we used an in-house dataset of drum stems that are crawled from many websites. The dataset consisted of 3,940 unique tracks averaging 225 seconds each for a total of 249 hours. Since the dataset was crawled from various websites, some details, such as the distribution of drum components, are hard to identify. The tracks were mostly popular western rock/pop music. Alternatives to this in-house dataset can be found in [7] (3,758 drum sample recordings (8 second = over 8 hours) or 60,000 synthesized drum loops (8 second = over 133 hours)) and [39] (4,197 drum tracks (259 hours)). We opted for the in-house dataset because it provided more diversity as it was not synthesized.
Each audio file was resampled to 16 kHz and downmixed to mono. The training batch size was 16, and for each audio file, we randomly selected a 2-second segment. On average, there were 112.5 segments in a track, and therefore training with 443,250 (=3,940 112.5) items would be approximately one epoch. With a Nvidia Tesla P100 and a batch size of 32, it took about 9 hours to train a single epoch. We implemented DrummerNet using Pytorch 1.0 [24] and used Librosa 0.6.3 [22] and Madmom 0.16 [4] for audio processing and peak-picking.
We used a heuristic peak-picking method introduced in [5]. This method selects a peak at if it satisfies the three conditions in Eq. (1),
[TABLE]
where the max window =50 ms, average window =100 ms, threshold =0.2, waiting window =50 ms, and is the last detected peak. We mainly use F1 score along with Precision and Recall using mir_eval [28]. The tolerance window is 50 ms.
After training, we test the system on three public datasets: IDMT-SMT-Drums (SMT, 104 drum tracks, total 130 minutes [10]), Medley-DB Drums (MDB, 23 tracks, total 20 minutes [36]), and ENST-drums (ENST, 61 minutes [14], drum-only tracks known as ‘wet-mix’ of ‘minus-one’ subset). According to [40], a task is DTD555DTD: drum transcription of drum-only recordings if tracks are drum-only, more precisely KD/SD/HH-only, and the system annotates KD/SD/HH events. This is the case for the SMT dataset. A task with the system annotating KD/SD/HH but with drum tracks consisting of more than those three components, e.g., tom-toms and cymbals, is named DTP666DTP: drum transcription in the presence of percussion in [40]. Following this convention, we evaluate DTD with SMT (Section 4.3), and DTP with MDB/ENST. We did not fine-tune for any dataset in any experiment and used the whole datasets for evaluation only.
4.2 Trend of Performance over Training
We did not employ a stopping strategy but trained the network for items (about 13 epochs). As illustrated in Figure 4, the overall performance gradually increases as the training proceeds and approaches converging towards the end of training. This indicates that the proposed loss function is a good proxy of transcription loss. After the initial phase of training, the performance differences among datasets remain consistent, probably due to the different characteristics of drum tracks in each dataset, as will be discussed in Section 4.4.
4.3 Relative Performance against Baselines
In this experiment, we trained our system on the in-house training set without any annotation and evaluated it on a separate test set (also known as ‘eval-cross (trained on DTP)’, [40]), which is a stronger condition than a usual train/test split scenario in supervised learning (‘eval-subset’, [40]). This setup allows us to measure the generalization capabilities across the datasets. Specifically, our experiment is equivalent to DTD, ‘eval-cross (trained on DTP)’ experiment in [40].777 Numbers are omitted in the paper but are available online: https://www.audiolabs-erlangen.de/resources/MIR/2017-DrumTranscription-Survey. , which is only available on SMT. Therefore, only the performances on SMT are compared in this Section. Overall, the performance of DrummerNet is favorable to that of recent drum transcription systems. With an average F1 score of 0.869 on SMT, the proposed unsupervised DrummerNet outperformed 9 out of 10 systems. The nine systems include ones with deep neural networks and supervised approach (ReLUts, RNN, lstmpB, tanhB, and GRUts [37, 34, 33, 38]), as well as ones with NMF and unsupervised approach (AM1, AM2, PFNMF, and SANMF [10, 41]). It did not outperform NMFD [20], a system based on the convolutive NMF.
The comparison between DrummerNet and the NMF/unsupervised learning-based systems [10, 41] implies that the proposed deep neural network structure effectively learns relevant representations. Furthermore, DrummerNet allows constant-time inference, unlike NMF and other factorization-based approaches which require iterative optimization in the test time.
What is more interesting is its generalizability. All the deep learning based systems888RNN, tanhB, ReLUts, lstmpB, GRUts - RNN-based systems present deteriorating performance in the transfer learning scenario (eval-cross) compared to the dataset split scenario (eval-subset).999See Figure 10 (b) of [40]. Note that most of the reported scores in papers also follow eval-subset setup. However, less data-driven approaches101010SANMF, NMF, PFNMF, AM1, AM2 - NMF-based systems present similar or even increased performances in eval-cross. This implies that the distributions within datasets are fairly different and biased to certain types of drum tracks and therefore, a transcription system trained with those datasets will be also biased accordingly. This limitation may be attributed to the small sizes of those datasets. Theoretically, supervised deep learning systems may generalize better if trained on a very large dataset, which lacks practicality due to the high annotation cost. In contrast, it is relatively easy to unbias DrummerNet. One only needs to control the distribution of drum tracks by their style/genre/sounds without annotating every note.
4.4 Qualitative Analysis
In this section, we will analyze the performance and the behavior of DrummerNet by components, datasets, and metrics, as illustrated in Figure 6. Here, we notice two clear trends. First, across all of the three datasets and the metrics, detecting KD was the easiest, followed by SD and HH (except the precision on SMT). Second, SMT seems to be the easiest, followed by MDB and ENST. What could be the reasons?
The first trend is strongly related the proposed loss function. KD has the least within-class variability while being the most distinguishable component (the largest mutual-class variability) due to its solitary frequency range. SD and HH share both the mid and high-frequency ranges and their sounds can vary significantly across drum kits – i.e., larger within-class variability and smaller mutual-class variability. A common pattern, consequently, is the false positive of HH due to SD and vice versa. This is presented in Figure 7, where SD has many false positives due to HH.
The second trend is caused by the mixed use of the probability and the onset velocity in the DrummerNet. Although transcription is the estimated amplitude of drum components, the peak-picking method treats as if it was a probability. This discrepancy becomes problematic when the velocities of drum events in a track vary drastically as in the case of MDB and ENST. A failure case is demonstrated in Figure 7, where the HH with strong accents on several occasion caused DrummerNet to miss many of the other HH peaks.
4.5 Ablation Study
We conducted an ablation study where the performance of DrummerNet is compared with that of its variants. Figure 8 shows the reported F1 scores averaged over datasets and components. Please refer to the caption in Figure 8 for the definitions of the system names.
Sparsemax (DFL vs. SOFT)
Among all the variants in this experiment, we observe the most dramatic change in the performance when we replaced Sparsemax with Softmax (SOFT), mostly in a negative way. In SOFT, the two Softmax layers were applied in sequence instead of in-parallel and multiplied, which we tested, but the training was unstable. The transcription of SOFT tends to be much noisier with many false positives, as presented in Figure 9. We conclude that the sparsity induced by Sparsemax is a crucial factor behind the success of the proposed unsupervised transcription.
Figure 9 provides a good example of the performance degradation pattern for each component. As in Figure 8, although the scores of all the three components decrease in SOFT, the degradation is not as critical for HH as in the case of KD/SD. This observation reflects the underlying properties of the different components. KD and SD are sparser than HH, and thus may benefit more from the introduction of Sparsemax.
CQT (DFL vs. MEL vs. STFT)
Replacing CQTs with either melspectrograms (MEL) or short-time Fourier transform magnitudes (STFT) results in decreased performance. Unlike CQTs, where different numbers of FFT are used for each octave range, melspectrograms are computed based on single-resolution STFT. This implies that DrummerNet benefits from CQTs which consider multiple temporal and spectral resolutions.
Comparing MEL and STFT, the melfrequency compression helps with the better detection of KD but not SD nor HH. This is explained by the different frequency band weighting of STFT and melspectrogram. Since melfrequency is linear below 1 kHz and logarithmic above 1 kHz [32], melspectrogram allocates relatively more bins below 1 kHz. This means that the loss function in MEL is biased towards the low-frequency range, resulting in training that favors KD over the others.
Onset Enhancement (DFL vs. NOE)
The onset enhancement is shown to be boosting the performance, but not significantly (0.017). In the learning curve, we observe that removing the onset enhancement from the loss function results in a large performance degradation during the initial phase of training. This is mainly due to false-positives in the non-transient part.
Recurrent layers (DFL vs. CONV)
Overall, replacing three recurrent layers with three convolutional layers does not make significant differences (0.011). This may means i) a long-term relationship may not provide additional information, probably because the transcription largely depends on local information, and ii) the mutual conditioning in the last recurrent layer is not effective in our experiment. In an informal analysis, we observed that with recurrent layers, still has some local temporal correlation, e.g., the activations are smeared over time, probably because that is better to reconstruct the input audio.
5 Conclusion
We introduced DrummerNet, a deep neural network that is trained to transcribe drum tracks without a labeled dataset. In the experiment, DrummerNet achieved strong performance compared to existing systems trained with supervised learning, showing its generalizability towards a real-world drum transcription scenario. Our ablation study showed that Sparsemax and CQT played a crucial role in the successful training of DrummerNet.
The experiment also revealed room for further improvements. Considering the discreteness of the musical notes, a reinforcement learning approach may be more suitable [35], making the prediction more sparse and replacing the peak-picking with trainable action. The onset-enhancement on audio similarity is a function carefully-chosen in order to approximate when and are given. Unfortunately, the approximation is limited because the exact drum sounds in are not given, and therefore a perfect reconstruct of (onsets of) the input audio () does not lead to a perfect transcription (). An alternative way would be measuring a similarity on a (perceptual) representation domain instead of the audio, for example, by learning a loss using forward-backward consistency (also known as a cyclic loss [17]) or known audio features. Lastly, the current synthesizer module is limited to drums as it does not handle the duration of notes. A trainable synthesizer can be used to expand DrummerNet to other instruments [11, 3], eventually leading to an unsupervised universal transcription system combined with instrument recognition.
6 Acknowledgement
We thank Tristan Jehan and Sebastian Ewert for their valuable comments and discussions. We would also like to express our sincere gratitude to Chih-Wei Wu for sharing his insight with us.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Samer A Abdallah and Mark D Plumbley. Unsupervised analysis of polyphonic music by sparse coding. IEEE Transactions on neural Networks , 17(1):179–196, 2006.
- 2[2] Taylor Berg-Kirkpatrick, Jacob Andreas, and Dan Klein. Unsupervised transcription of piano music. In Advances in neural information processing systems , pages 1538–1546, 2014.
- 3[3] Merlijn Blaauw and Jordi Bonada. A neural parametric singing synthesizer. ar Xiv preprint ar Xiv:1704.03809 , 2017.
- 4[4] Sebastian Böck, Filip Korzeniowski, Jan Schlüter, Florian Krebs, and Gerhard Widmer. madmom: a new Python Audio and Music Signal Processing Library. In Proceedings of the 24th ACM International Conference on Multimedia , pages 1174–1178, Amsterdam, The Netherlands, 10 2016.
- 5[5] Sebastian Böck, Florian Krebs, and Markus Schedl. Evaluating the online capabilities of onset detection methods. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR) , pages 49–54, 2012.
- 6[6] Sebastian Böck and Markus Schedl. Polyphonic piano note transcription with recurrent neural networks. In IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages 121–124. IEEE, 2012.
- 7[7] Mark Cartwright and Juan Pablo Bello. Increasing drum transcription vocabulary using data synthesis. Proc. of the 21st Int. Conference on Digital Audio Effects (DA Fx-18). Aveiro, Portugal , 2018.
- 8[8] Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. In Neur IPS - Workshop on Deep Learning, December 2014 , 2014.
