Rethinking Music Captioning with Music Metadata LLMs

Irmak Bukey; Zhepei Wang; Chris Donahue; Nicholas J. Bryan

arXiv:2602.03023·cs.SD·February 4, 2026

Rethinking Music Captioning with Music Metadata LLMs

Irmak Bukey, Zhepei Wang, Chris Donahue, Nicholas J. Bryan

PDF

Open Access

TL;DR

This paper introduces a metadata-based music captioning approach that predicts detailed music metadata from audio and uses LLMs to generate expressive captions, offering flexibility and efficiency over traditional end-to-end methods.

Contribution

It proposes a novel metadata prediction and captioning pipeline that reduces training time and allows style customization, improving over existing captioning models.

Findings

01

Comparable performance to end-to-end models with less training

02

Flexible caption stylization post-training

03

Effective metadata imputation and in-filling capabilities

Abstract

Music captioning, or the task of generating a natural language description of music, is useful for both music understanding and controllable music generation. Training captioning models, however, typically requires high-quality music caption data which is scarce compared to metadata (e.g., genre, mood, etc.). As a result, it is common to use large language models (LLMs) to synthesize captions from metadata to generate training data for captioning models, though this process imposes a fixed stylization and entangles factual information with natural language style. As a more direct approach, we propose metadata-based captioning. We train a metadata prediction model to infer detailed music metadata from audio and then convert it into expressive captions via pre-trained LLMs at inference time. Compared to a strong end-to-end baseline trained on LLM-generated captions derived from metadata,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Generative Adversarial Networks and Image Synthesis