GaMMA: Towards Joint Global-Temporal Music Understanding in Large Multimodal Models

Zuyao You; Zhesong Yu; Mingyu Liu; Bilei Zhu; Yuan Wan; Zuxuan Wu

arXiv:2605.00371·cs.SD·May 4, 2026

GaMMA: Towards Joint Global-Temporal Music Understanding in Large Multimodal Models

Zuyao You, Zhesong Yu, Mingyu Liu, Bilei Zhu, Yuan Wan, Zuxuan Wu

PDF

TL;DR

GaMMA is a large multimodal model that advances comprehensive music understanding by integrating audio and language processing, trained on extensive datasets, and evaluated with a new benchmark, MusicBench.

Contribution

This paper introduces GaMMA, a novel multimodal model that unifies temporal and non-temporal music understanding tasks within a single framework, and presents MusicBench, a large-scale music understanding benchmark.

Findings

01

GaMMA achieves state-of-the-art accuracy on multiple music understanding tasks.

02

GaMMA effectively unifies time-series and non-time-series music understanding.

03

MusicBench provides a comprehensive evaluation platform for music LMMs.

Abstract

In this paper, we propose GaMMA, a state-of-the-art (SoTA) large multimodal model (LMM) designed to achieve comprehensive musical content understanding. GaMMA inherits the streamlined encoder-decoder design of LLaVA, enabling effective cross-modal learning between music and language. By incorporating audio encoders in a mixture-of-experts manner, GaMMA effectively unifies both time-series and non-time-series music understanding tasks within one set of parameters. Our approach combines carefully curated datasets at scale with a progressive training pipeline, effectively pushing the boundaries of music understanding via pretraining, supervised fine-tuning (SFT), and reinforcement learning (RL). To comprehensively assess both temporal and non-temporal capability of music LMMs, we introduce MusicBench, the largest music-oriented benchmark, comprising 3,739 human-curated multiple-choice…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.