Sheet Music Transformer: End-To-End Optical Music Recognition Beyond Monophonic Transcription
Antonio R\'ios-Vila, Jorge Calvo-Zaragoza, Thierry Paquet

TL;DR
The paper introduces the Sheet Music Transformer, an end-to-end model that transcribes complex polyphonic scores directly from images, surpassing existing methods in accuracy and scalability.
Contribution
It presents the first Transformer-based end-to-end OMR model capable of handling polyphonic music scores without monophonic simplifications.
Findings
Outperforms state-of-the-art OMR methods on two datasets.
Effectively transcribes complex polyphonic scores.
Demonstrates scalability to intricate musical structures.
Abstract
State-of-the-art end-to-end Optical Music Recognition (OMR) has, to date, primarily been carried out using monophonic transcription techniques to handle complex score layouts, such as polyphony, often by resorting to simplifications or specific adaptations. Despite their efficacy, these approaches imply challenges related to scalability and limitations. This paper presents the Sheet Music Transformer, the first end-to-end OMR model designed to transcribe complex musical scores without relying solely on monophonic strategies. Our model employs a Transformer-based image-to-sequence framework that predicts score transcriptions in a standard digital music encoding format from input images. Our model has been tested on two polyphonic music datasets and has proven capable of handling these intricate music structures effectively. The experimental outcomes not only indicate the competence of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Diverse Musicological Studies · Music Technology and Sound Studies
MethodsAttention Is All You Need · Position-Wise Feed-Forward Layer · Dense Connections · Label Smoothing · Absolute Position Encodings · Softmax · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Residual Connection
