TL;DR
This paper investigates invariance and data augmentation techniques in supervised music transcription, demonstrating a translation-invariant neural network that achieves state-of-the-art results on human recordings.
Contribution
It introduces a translation-invariant model combining filterbanks and CNNs, leveraging frequency invariance and label-preserving augmentations for improved transcription.
Findings
Top-performing model in 2017 MIREX evaluation
Reduced model parameters through frequency invariance
Effective use of pitch-shift data augmentation
Abstract
This paper explores a variety of models for frame-based music transcription, with an emphasis on the methods needed to reach state-of-the-art on human recordings. The translation-invariant network discussed in this paper, which combines a traditional filterbank with a convolutional neural network, was the top-performing model in the 2017 MIREX Multiple Fundamental Frequency Estimation evaluation. This class of models shares parameters in the log-frequency domain, which exploits the frequency invariance of music to reduce the number of model parameters and avoid overfitting to the training data. All models in this paper were trained with supervision by labeled data from the MusicNet dataset, augmented by random label-preserving pitch-shift transformations.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
