# Score Images as a Modality: Enhancing Symbolic Music Understanding through Large-Scale Multimodal Pre-Training

**Authors:** Yang Qin, Huiming Xie, Shuxue Ding, Yujie Li, Benying Tan, Mingchuan Ye

PMC · DOI: 10.3390/s24155017 · Sensors (Basel, Switzerland) · 2024-08-02

## TL;DR

This paper introduces a new method for understanding symbolic music by combining music score images with MIDI data using multimodal pre-training.

## Contribution

The novel SIM model and pre-training tasks that integrate score images and MIDI for symbolic music understanding.

## Key findings

- The SIM model effectively captures music structures and aligns visual and symbolic representations.
- The proposed pre-training tasks improve symbolic music understanding through experimental validation.

## Abstract

Symbolic music understanding is a critical challenge in artificial intelligence. While traditional symbolic music representations like MIDI capture essential musical elements, they often lack the nuanced expression in music scores. Leveraging the advancements in multimodal pre-training, particularly in visual-language pre-training, we propose a groundbreaking approach: the Score Images as a Modality (SIM) model. This model integrates music score images alongside MIDI data for enhanced symbolic music understanding. We also introduce novel pre-training tasks, including masked bar-attribute modeling and score-MIDI matching. These tasks enable the SIM model to capture music structures and align visual and symbolic representations effectively. Additionally, we present a meticulously curated dataset of matched score images and MIDI representations optimized for training the SIM model. Through experimental validation, we demonstrate the efficacy of our approach in advancing symbolic music understanding.

## Full-text entities

- **Diseases:** injury to people or property (MESH:C000719191), MIDI (MESH:D005547)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC11314789/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11314789/full.md

## References

38 references — full list in the complete paper: https://tomesphere.com/paper/PMC11314789/full.md

---
Source: https://tomesphere.com/paper/PMC11314789