Count The Notes: Histogram-Based Supervision for Automatic Music Transcription

Jonathan Yaffe; Ben Maman; Meinard M\"uller; Amit H. Bermano

arXiv:2511.14250·cs.SD·November 19, 2025

Count The Notes: Histogram-Based Supervision for Automatic Music Transcription

Jonathan Yaffe, Ben Maman, Meinard M\"uller, Amit H. Bermano

PDF

Open Access 2 Models

TL;DR

CountEM introduces a histogram-based supervision method for automatic music transcription that removes the need for local alignment, reducing annotation effort while maintaining high accuracy across multiple instruments.

Contribution

It presents CountEM, a novel EM-based framework that uses note occurrence histograms for weakly supervised AMT, eliminating local alignment requirements and improving efficiency.

Findings

01

CountEM achieves comparable or better accuracy than existing weakly supervised methods.

02

The approach reduces annotation effort by relying solely on note counts.

03

Experiments show improved robustness and scalability across diverse datasets.

Abstract

Automatic Music Transcription (AMT) converts audio recordings into symbolic musical representations. Training deep neural networks (DNNs) for AMT typically requires strongly aligned training pairs with precise frame-level annotations. Since creating such datasets is costly and impractical for many musical contexts, weakly aligned approaches using segment-level annotations have gained traction. However, existing methods often rely on Dynamic Time Warping (DTW) or soft alignment loss functions, both of which still require local semantic correspondences, making them error-prone and computationally expensive. In this article, we introduce CountEM, a novel AMT framework that eliminates the need for explicit local alignment by leveraging note event histograms as supervision, enabling lighter computations and greater flexibility. Using an Expectation-Maximization (EM) approach, CountEM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Time Series Analysis and Forecasting · Music Technology and Sound Studies