Unaligned Supervision For Automatic Music Transcription in The Wild
Ben Maman, Amit H. Bermano

TL;DR
This paper introduces NoteEM, a fully automated method for training music transcription models using unaligned, in-the-wild recordings, achieving state-of-the-art accuracy across diverse instruments without manual score alignment.
Contribution
NoteEM enables training on unaligned, real-world recordings with minimal human intervention, improving multi-instrument automatic music transcription accuracy and robustness.
Findings
Achieved state-of-the-art note-level accuracy on the MAPS dataset.
Demonstrated strong cross-dataset generalization.
Showed robustness with small, self-collected datasets.
Abstract
Multi-instrument Automatic Music Transcription (AMT), or the decoding of a musical recording into semantic musical content, is one of the holy grails of Music Information Retrieval. Current AMT approaches are restricted to piano and (some) guitar recordings, due to difficult data collection. In order to overcome data collection barriers, previous AMT approaches attempt to employ musical scores in the form of a digitized version of the same song or piece. The scores are typically aligned using audio features and strenuous human intervention to generate training labels. We introduce NoteEM, a method for simultaneously training a transcriber and aligning the scores to their corresponding performances, in a fully-automated process. Using this unaligned supervision scheme, complemented by pseudo-labels and pitch-shift augmentation, our method enables training on in-the-wild recordings with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Diverse Musicological Studies
