LEGATO: Large-scale End-to-end Generalizable Approach to Typeset OMR
Guang Yang, Victoria Ebert, Nazif Tamer, Brian Siyuan Zheng, Luiza Pozzobon, Noah A. Smith

TL;DR
Legato is a large-scale, end-to-end pretrained model for optical music recognition that effectively converts music score images into ABC notation, demonstrating strong generalization and superior performance over previous methods.
Contribution
This paper introduces Legato, the first large-scale pretrained OMR model capable of recognizing full-page scores and generating ABC notation, with extensive experiments showing its effectiveness.
Findings
Achieves 68% error reduction on TEDn metric
Achieves 47.6% error reduction on OMR-NED metric
Outperforms previous state-of-the-art models
Abstract
We propose Legato, a new end-to-end model for optical music recognition (OMR), a task of converting music score images to machine-readable documents. Legato is the first large-scale pretrained OMR model capable of recognizing full-page or multi-page typeset music scores and the first to generate documents in ABC notation, a concise, human-readable format for symbolic music. Bringing together a pretrained vision encoder with an ABC decoder trained on a dataset of more than 214K images, our model exhibits the strong ability to generalize across various typeset scores. We conduct comprehensive experiments on a range of datasets and metrics and demonstrate that Legato outperforms the previous state of the art. On our most realistic dataset, we see a 68\% and 47.6\% absolute error reduction on the standard metrics TEDn and OMR-NED, respectively.
Peer Reviews
Decision·ICLR 2026 Poster
- The PDMX-Synth dataset curated in this paper addresses the scarcity of real-world, complex music score data for OMR. Its public release would significantly benefit research in this field. - Experimental results demonstrate that LEGATO significantly outperforms both SMT++ and GPT-5, with strong performance on real-world scenarios (camera versions of OpenScore String Quartets and OpenScore Lieder)—highlighting the effectiveness and practicality of the proposed approach.
1. Limited in performance analysis of general VLMs. 1. The paper relies solely on GPT-5 as the representative general-purpose vision-language model for comparison, which somewhat limits the persuasiveness of the evaluation. Including additional state-of-the-art multimodal models—such as those from the Gemini, Claude, and Qwen families—would provide a more comprehensive and robust assessment. 2. The evaluation of GPT-5 lacks sufficient detail. The paper does not specify the prompts provi
Clear problem focus & scope. Tackles practical, full-page OMR with multi-system/stave layouts—beyond monophonic or single-system settings. Strong empirical gains. Large improvements on TEDn/OMR-NED across several datasets, including an IMSLP sample. Standardized evaluation. Using MusicXML as a unifying target for comparison reduces format bias; reporting ABC/kern helps triangulate. Scalable training recipe. Pretrained vision encoder + compact decoder yields a seemingly efficient path to high
Attribution of gains is unclear. It’s hard to disentangle where improvements come from: ABC target format, frozen VLM encoder, data scale (214K), or architecture. A dedicated ablation is missing. Evaluation confounds. Conversions among ABC/ kern/MusicXML may introduce asymmetric errors; conversion failure rates and their impact on metrics are not reported. Metric interpretability. TEDn/OMR-NED reductions are compelling but lack qualitative error analysis or audible case studies linking metric
- OMR is an interesting and important research topic in music information retrieval with strong ongoing research interest. - The proposed vision encoder and Transformer decoder form a well-designed architecture for OMR. - The system is trained on a large-scale dataset of 214k images, demonstrating strong generalization ability. - The procedure for creating the dataset is well-designed and systematic. - Comprehensive experiments are conducted across a range of datasets, showing state-of-the-a
- The proposed PDMX-Synth dataset is derived from the PDMX dataset; therefore, it is effectively a subset and smaller in scale. Will this affect the performance? - The model architecture shares similarities with the design of SMT++, indicating limited novelty in architectural design. - There is limited comparison of different vision encoders and more decoder sizes, leading to the technical part a bit weak. - There is a lack of equations, such as autoregressive prediction, and loss functions
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Fuzzy Logic and Control Systems · Rough Sets and Fuzzy Logic
MethodsApproximate Bayesian Computation · Sparse Evolutionary Training
