NOTA: Multimodal Music Notation Understanding for Visual Large Language   Model

Mingni Tang; Jiajia Li; Lu Yang; Zhiqiang Zhang; Jinghao Tian; Zuchao; Li; Lefei Zhang; Ping Wang

arXiv:2502.14893·cs.CV·February 24, 2025

NOTA: Multimodal Music Notation Understanding for Visual Large Language Model

Mingni Tang, Jiajia Li, Lu Yang, Zhiqiang Zhang, Jinghao Tian, Zuchao, Li, Lefei Zhang, Ping Wang

PDF

1 Video

TL;DR

This paper introduces NOTA, a large-scale multimodal music notation dataset and a visual language model, NotaGPT, which significantly advances music notation understanding across visual and textual modalities.

Contribution

The paper presents the first comprehensive multimodal music notation dataset and a specialized large language model trained on it, enabling improved music notation understanding.

Findings

01

NotaGPT-7B outperforms existing models in music understanding tasks.

02

The NOTA dataset covers diverse regions, enhancing model generalization.

03

The training pipeline effectively aligns visual and textual music representations.

Abstract

Symbolic music is represented in two distinct forms: two-dimensional, visually intuitive score images, and one-dimensional, standardized text annotation sequences. While large language models have shown extraordinary potential in music, current research has primarily focused on unimodal symbol sequence text. Existing general-domain visual language models still lack the ability of music notation understanding. Recognizing this gap, we propose NOTA, the first large-scale comprehensive multimodal music notation dataset. It consists of 1,019,237 records, from 3 regions of the world, and contains 3 tasks. Based on the dataset, we trained NotaGPT, a music notation visual large language model. Specifically, we involve a pre-alignment training phase for cross-modal alignment between the musical notes depicted in music score images and their textual representation in ABC notation. Subsequent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

NOTA: Multimodal Music Notation Understanding for Visual Large Language Model· underline

Taxonomy

MethodsApproximate Bayesian Computation · Focus