Count The Notes: Histogram-Based Supervision for Automatic Music Transcription
Jonathan Yaffe, Ben Maman, Meinard M\"uller, Amit H. Bermano

TL;DR
CountEM introduces a histogram-based supervision method for automatic music transcription that removes the need for local alignment, reducing annotation effort while maintaining high accuracy across multiple instruments.
Contribution
It presents CountEM, a novel EM-based framework that uses note occurrence histograms for weakly supervised AMT, eliminating local alignment requirements and improving efficiency.
Findings
CountEM achieves comparable or better accuracy than existing weakly supervised methods.
The approach reduces annotation effort by relying solely on note counts.
Experiments show improved robustness and scalability across diverse datasets.
Abstract
Automatic Music Transcription (AMT) converts audio recordings into symbolic musical representations. Training deep neural networks (DNNs) for AMT typically requires strongly aligned training pairs with precise frame-level annotations. Since creating such datasets is costly and impractical for many musical contexts, weakly aligned approaches using segment-level annotations have gained traction. However, existing methods often rely on Dynamic Time Warping (DTW) or soft alignment loss functions, both of which still require local semantic correspondences, making them error-prone and computationally expensive. In this article, we introduce CountEM, a novel AMT framework that eliminates the need for explicit local alignment by leveraging note event histograms as supervision, enabling lighter computations and greater flexibility. Using an Expectation-Maximization (EM) approach, CountEM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Time Series Analysis and Forecasting · Music Technology and Sound Studies
