MIDI-LLaMA: An Instruction-Following Multimodal LLM for Symbolic Music Understanding

Meng Yang; Jon McCormack; Maria Teresa Llano; Wanchao Su; Chao Lei

arXiv:2601.21740·cs.MM·January 30, 2026

MIDI-LLaMA: An Instruction-Following Multimodal LLM for Symbolic Music Understanding

Meng Yang, Jon McCormack, Maria Teresa Llano, Wanchao Su, Chao Lei

PDF

Open Access

TL;DR

MIDI-LLaMA is a novel instruction-following multimodal large language model designed specifically for symbolic music understanding, demonstrating superior performance in music captioning, semantic alignment, and human-evaluated musical comprehension tasks.

Contribution

It introduces the first instruction-following MLLM for symbolic music, combining MusicBERT and Llama-3-8B through a two-stage training pipeline with a new MIDI-text dataset.

Findings

01

Outperforms baseline in music captioning and question answering

02

Human evaluation shows better music understanding and emotion recognition

03

Enhances LLM capabilities with symbolic music understanding

Abstract

Recent advances in multimodal large language models (MLLM) for audio music have demonstrated strong capabilities in music understanding, yet symbolic music, a fundamental representation of musical structure, remains unexplored. In this work, we introduce MIDI-LLaMA, the first instruction-following MLLM for symbolic music understanding. Our approach aligns the MIDI encoder MusicBERT and Llama-3-8B via a two-stage pipeline comprising feature alignment and instruction tuning. To support training, we design a scalable annotation pipeline that annotates GiantMIDI-Piano with fine-grained metadata, resulting in a MIDI-text dataset. Compared with the baseline trained on converting MIDI into ABC notation under the same instruction-tuning procedure, MIDI-LLaMA substantially outperforms in captioning and semantic alignment in question answering. Human evaluation further confirms the advantages of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Neuroscience and Music Perception