Melody transcription via generative pre-training
Chris Donahue, John Thickstun, Percy Liang

TL;DR
This paper introduces a novel melody transcription method leveraging generative pre-training with broad music representations and a new dataset, significantly improving accuracy and enabling direct transcription of lead sheets from audio.
Contribution
It presents a new approach combining generative pre-training and a large crowdsourced dataset to enhance melody transcription across diverse musical styles and instruments.
Findings
20% performance improvement over conventional methods
77% stronger performance with the new dataset
Enables direct transcription of lead sheets from audio
Abstract
Despite the central role that melody plays in music perception, it remains an open challenge in music information retrieval to reliably detect the notes of the melody present in an arbitrary music recording. A key challenge in melody transcription is building methods which can handle broad audio containing any number of instrument ensembles and musical styles - existing strategies work well for some melody instruments or styles but not all. To confront this challenge, we leverage representations from Jukebox (Dhariwal et al. 2020), a generative model of broad music audio, thereby improving performance on melody transcription by % relative to conventional spectrogram features. Another obstacle in melody transcription is a lack of training data - we derive a new dataset containing hours of melody transcriptions from crowdsourced annotations of broad music. The combination of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing
MethodsDense Connections · Residual Connection · Layer Normalization · Dilated Convolution · Position-Wise Feed-Forward Layer · VQ-VAE · Convolution · Jukebox
