TL;DR
This paper critically examines the Onsets-and-Frames model for music transcription, revealing that attention mechanisms beyond a certain context are unnecessary and that post-processing plays a key role in achieving high accuracy.
Contribution
It introduces a modified additive attention mechanism and highlights the importance of onset detection and post-processing in state-of-the-art AMT models.
Findings
Attention beyond moderate context is not beneficial.
Post-processing largely accounts for SOTA performance.
Onsets are the most significant attentive feature.
Abstract
Recent advances in automatic music transcription (AMT) have achieved highly accurate polyphonic piano transcription results by incorporating onset and offset detection. The existing literature, however, focuses mainly on the leverage of deep and complex models to achieve state-of-the-art (SOTA) accuracy, without understanding model behaviour. In this paper, we conduct a comprehensive examination of the Onsets-and-Frames AMT model, and pinpoint the essential components contributing to a strong AMT performance. This is achieved through exploitation of a modified additive attention mechanism. The experimental results suggest that the attention mechanism beyond a moderate temporal context does not benefit the model, and that rule-based post-processing is largely responsible for the SOTA performance. We also demonstrate that the onsets are the most significant attentive feature regardless of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
