ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence
Menghe Ma, Siqing Wei, Yuecheng Xing, Yaheng Wang, Fanhong Meng, Peijun Han, Luu Anh Tuan, Haoran Luo

TL;DR
ONOTE is a comprehensive benchmark designed to evaluate omnimodal notation processing in music AI, addressing current fragmentation and biases, and revealing gaps in perceptual accuracy versus musical understanding.
Contribution
The paper introduces ONOTE, a deterministic, multi-format benchmark for rigorous evaluation of omnimodal music notation models, highlighting reasoning gaps in current approaches.
Findings
Current models show a disconnect between perceptual accuracy and music-theoretic understanding.
ONOTE provides a standardized, bias-free evaluation framework for diverse notation systems.
Evaluation exposes fundamental reasoning vulnerabilities in leading omnimodal models.
Abstract
Omnimodal Notation Processing (ONP) represents a unique frontier for omnimodal AI due to the rigorous, multi-dimensional alignment required across auditory, visual, and symbolic domains. Current research remains fragmented, focusing on isolated transcription tasks that fail to bridge the gap between superficial pattern recognition and the underlying musical logic. This landscape is further complicated by severe notation biases toward Western staff and the inherent unreliability of "LLM-as-a-judge" metrics, which often mask structural reasoning failures with systemic hallucinations. To establish a more rigorous standard, we introduce ONOTE, a multi-format benchmark that utilizes a deterministic pipeline--grounded in canonical pitch projection--to eliminate subjective scoring biases across diverse notation systems. Our evaluation of leading omnimodal models exposes a fundamental…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
