LEGATO: Large-scale End-to-end Generalizable Approach to Typeset OMR

Guang Yang; Victoria Ebert; Nazif Tamer; Brian Siyuan Zheng; Luiza Pozzobon; Noah A. Smith

arXiv:2506.19065·cs.CV·October 3, 2025

LEGATO: Large-scale End-to-end Generalizable Approach to Typeset OMR

Guang Yang, Victoria Ebert, Nazif Tamer, Brian Siyuan Zheng, Luiza Pozzobon, Noah A. Smith

PDF

Open Access 1 Repo 2 Models 3 Reviews

TL;DR

Legato is a large-scale, end-to-end pretrained model for optical music recognition that effectively converts music score images into ABC notation, demonstrating strong generalization and superior performance over previous methods.

Contribution

This paper introduces Legato, the first large-scale pretrained OMR model capable of recognizing full-page scores and generating ABC notation, with extensive experiments showing its effectiveness.

Findings

01

Achieves 68% error reduction on TEDn metric

02

Achieves 47.6% error reduction on OMR-NED metric

03

Outperforms previous state-of-the-art models

Abstract

We propose Legato, a new end-to-end model for optical music recognition (OMR), a task of converting music score images to machine-readable documents. Legato is the first large-scale pretrained OMR model capable of recognizing full-page or multi-page typeset music scores and the first to generate documents in ABC notation, a concise, human-readable format for symbolic music. Bringing together a pretrained vision encoder with an ABC decoder trained on a dataset of more than 214K images, our model exhibits the strong ability to generalize across various typeset scores. We conduct comprehensive experiments on a range of datasets and metrics and demonstrate that Legato outperforms the previous state of the art. On our most realistic dataset, we see a 68\% and 47.6\% absolute error reduction on the standard metrics TEDn and OMR-NED, respectively.

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- The PDMX-Synth dataset curated in this paper addresses the scarcity of real-world, complex music score data for OMR. Its public release would significantly benefit research in this field. - Experimental results demonstrate that LEGATO significantly outperforms both SMT++ and GPT-5, with strong performance on real-world scenarios (camera versions of OpenScore String Quartets and OpenScore Lieder)—highlighting the effectiveness and practicality of the proposed approach.

Weaknesses

1. Limited in performance analysis of general VLMs. 1. The paper relies solely on GPT-5 as the representative general-purpose vision-language model for comparison, which somewhat limits the persuasiveness of the evaluation. Including additional state-of-the-art multimodal models—such as those from the Gemini, Claude, and Qwen families—would provide a more comprehensive and robust assessment. 2. The evaluation of GPT-5 lacks sufficient detail. The paper does not specify the prompts provi

Reviewer 02Rating 4Confidence 2

Strengths

Clear problem focus & scope. Tackles practical, full-page OMR with multi-system/stave layouts—beyond monophonic or single-system settings. Strong empirical gains. Large improvements on TEDn/OMR-NED across several datasets, including an IMSLP sample. Standardized evaluation. Using MusicXML as a unifying target for comparison reduces format bias; reporting ABC/kern helps triangulate. Scalable training recipe. Pretrained vision encoder + compact decoder yields a seemingly efficient path to high

Weaknesses

Attribution of gains is unclear. It’s hard to disentangle where improvements come from: ABC target format, frozen VLM encoder, data scale (214K), or architecture. A dedicated ablation is missing. Evaluation confounds. Conversions among ABC/ kern/MusicXML may introduce asymmetric errors; conversion failure rates and their impact on metrics are not reported. Metric interpretability. TEDn/OMR-NED reductions are compelling but lack qualitative error analysis or audible case studies linking metric

Reviewer 03Rating 6Confidence 4

Strengths

- OMR is an interesting and important research topic in music information retrieval with strong ongoing research interest. - The proposed vision encoder and Transformer decoder form a well-designed architecture for OMR. - The system is trained on a large-scale dataset of 214k images, demonstrating strong generalization ability. - The procedure for creating the dataset is well-designed and systematic. - Comprehensive experiments are conducted across a range of datasets, showing state-of-the-a

Weaknesses

- The proposed PDMX-Synth dataset is derived from the PDMX dataset; therefore, it is effectively a subset and smaller in scale. Will this affect the performance? - The model architecture shares similarities with the design of SMT++, indicating limited novelty in architectural design. - There is limited comparison of different vision encoders and more decoder sizes, leading to the technical part a bit weak. - There is a lack of equations, such as autoregressive prediction, and loss functions

Code & Models

Repositories

guang-yng/legato
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Fuzzy Logic and Control Systems · Rough Sets and Fuzzy Logic

MethodsApproximate Bayesian Computation · Sparse Evolutionary Training