TL;DR
This paper investigates how spectrogram reconstruction loss influences automatic music transcription models, demonstrating that it can improve note-level accuracy and frame-level precision without relying on supervised onset/offset sub-tasks.
Contribution
The study introduces a dual U-net architecture trained with spectrogram reconstruction loss, showing its effectiveness in enhancing transcription accuracy over models without reconstruction.
Findings
Reconstruction loss improves note-level transcription accuracy.
Reconstruction loss boosts frame-level precision beyond state-of-the-art.
Feature maps exhibit gridlike structures indicating counting along time and frequency.
Abstract
Most of the state-of-the-art automatic music transcription (AMT) models break down the main transcription task into sub-tasks such as onset prediction and offset prediction and train them with onset and offset labels. These predictions are then concatenated together and used as the input to train another model with the pitch labels to obtain the final transcription. We attempt to use only the pitch labels (together with spectrogram reconstruction loss) and explore how far this model can go without introducing supervised sub-tasks. In this paper, we do not aim at achieving state-of-the-art transcription accuracy, instead, we explore the effect that spectrogram reconstruction has on our AMT model. Our proposed model consists of two U-nets: the first U-net transcribes the spectrogram into a posteriorgram, and a second U-net transforms the posteriorgram back into a spectrogram. A…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsConvolution · Concatenated Skip Connection · *Communicated@Fast*How Do I Communicate to Expedia? · Max Pooling · U-Net
