Transcoda: End-to-End Zero-Shot Optical Music Recognition via Data-Centric Synthetic Training
Daniel Dratschuk, Paul Swoboda

TL;DR
Transcoda introduces a data-centric synthetic training approach with normalization and grammar-based decoding, enabling efficient end-to-end zero-shot optical music recognition that outperforms larger models.
Contribution
It presents a novel synthetic data pipeline, normalization of music encoding, and grammar-based decoding to improve OMR accuracy without large datasets.
Findings
Outperforms state-of-the-art baselines on synthetic benchmark with 18.46% OMR-NED
Reduces error rate on historical Polish scans to 63.97% OMR-NED
Trains a 59M-parameter model in 6 hours on a single GPU
Abstract
Optical Music Recognition (OMR), the task of transcribing sheet music into a structured textual representation, is currently bottlenecked by a lack of large-scale, annotated datasets of real scans. This forces models to rely on either few-shot transfer or synthetic training pipelines that remain overly simplistic. A secondary challenge is encoding non-uniqueness: in the popular Humdrum **kern format for transcribing music, multiple different text encodings can render into the same visual sheet music. This one-to-many mapping creates a harder learning task and introduces high uncertainty during decoding. We propose Transcoda, an OMR system built on (i) an advanced synthetic data generation pipeline, (ii) a normalization of the **kern encoding to enforce a unique normal form and (iii) grammar-based decoding to ensure the syntactic correctness of the output. This approach allows us to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
