MuPT: A Generative Symbolic Music Pretrained Transformer
Xingwei Qu, Yuelin Bai, Yinghao Ma, Ziya Zhou, Ka Man Lo, Jiaheng Liu,, Ruibin Yuan, Lejun Min, Xueling Liu, Tianyu Zhang, Xinrun Du, Shuyue Guo,, Yiming Liang, Yizhi Li, Shangda Wu, Junting Zhou, Tianyu Zheng, Ziyang Ma,, Fengze Han, Wei Xue, Gus Xia, Emmanouil Benetos

TL;DR
This paper introduces MuPT, a pretrained transformer model for symbolic music generation using ABC notation, demonstrating improved performance and coherence across multiple tracks with large token handling capabilities.
Contribution
The paper presents MuPT, a novel transformer model trained on ABC notation, and proposes SMT-ABC notation for synchronized multi-track music generation, advancing symbolic music modeling.
Findings
LLMs are more compatible with ABC notation than MIDI.
MuPT handles up to 8192 tokens, covering 90% of training data.
The study introduces the Symbolic Music Scaling Law (SMS Law).
Abstract
In this paper, we explore the application of Large Language Models (LLMs) to the pre-training of music. While the prevalent use of MIDI in music modeling is well-established, our findings suggest that LLMs are inherently more compatible with ABC Notation, which aligns more closely with their design and strengths, thereby enhancing the model's performance in musical composition. To address the challenges associated with misaligned measures from different tracks during generation, we propose the development of a Synchronized Multi-Track ABC Notation (SMT-ABC Notation), which aims to preserve coherence across multiple musical tracks. Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set. Furthermore, we explore the implications of the Symbolic Music Scaling Law (SMS Law) on model performance. The…
Peer Reviews
Decision·ICLR 2025 Poster
The study is conducted in a systematic fashion and covers a lot of the relevant topics that LLM-related literature should talk about, such as the relationship between dataset size, model parameters, etc with the quality of generated data. The quality of the generated music shared in the supplementary material is decent.
The authors do not share any samples from the baseline models in supplementary materials which would have helped support the results in the human study. The repetition metrics seem to not be clearly motivated. Why would the average repetition rate be a meaningful metric. Isn't the more important metric the position of the repeats? Some writing issues. E.g. missing reference in line 353, missing word in line 505.
1. Studying a foundation model for symbolic music is both interesting and meaningful work. 2. The team put significant effort into analyzing large datasets and training the foundation model. 3. The findings on scaling laws are interesting.
1. The research contribution is limited. To enhance the novelty of the approach, consider addressing research questions such as: What factors make ABC better than MIDI, and how does the uniqueness of ABC benefit large models in the feature space? Under what circumstances does the model perform better with ABC notation? 2. The comparison between ABC and MIDI is incomplete. The assumption that ABC is better than MIDI requires a detailed and careful analysis. For example, experiments exploring the
The introduction of SMT-ABC Notation represents a novel approach to enhancing the coherence of multi-track symbolic music generation. The use of ABC notation as a foundation for LLMs in music generation, instead of the more commonly used MIDI, adds an original perspective to the field. The concept of the Symbolic Music Scaling Law (SMS Law) is also a significant contribution, offering new insights into the training dynamics of symbolic music models. The paper is well-structured, with clear expla
1. The experimental design is questionable. SMT-ABC is the only methodological innovation in this paper, yet there is insufficient comparative experimentation to explore how SMT-ABC actually improves performance. Models like GPT-4 are not specifically designed for symbolic music generation, and it's predictable that the proposed model would perform better without even conducting experiments. Were the other models in the comparative experiments, such as GPT-4 and MMT, trained or fine-tuned on the
Videos
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Neuroscience and Music Perception
MethodsApproximate Bayesian Computation
