Rethinking Music Captioning with Music Metadata LLMs
Irmak Bukey, Zhepei Wang, Chris Donahue, Nicholas J. Bryan

TL;DR
This paper introduces a metadata-based music captioning approach that predicts detailed music metadata from audio and uses LLMs to generate expressive captions, offering flexibility and efficiency over traditional end-to-end methods.
Contribution
It proposes a novel metadata prediction and captioning pipeline that reduces training time and allows style customization, improving over existing captioning models.
Findings
Comparable performance to end-to-end models with less training
Flexible caption stylization post-training
Effective metadata imputation and in-filling capabilities
Abstract
Music captioning, or the task of generating a natural language description of music, is useful for both music understanding and controllable music generation. Training captioning models, however, typically requires high-quality music caption data which is scarce compared to metadata (e.g., genre, mood, etc.). As a result, it is common to use large language models (LLMs) to synthesize captions from metadata to generate training data for captioning models, though this process imposes a fixed stylization and entangles factual information with natural language style. As a more direct approach, we propose metadata-based captioning. We train a metadata prediction model to infer detailed music metadata from audio and then convert it into expressive captions via pre-trained LLMs at inference time. Compared to a strong end-to-end baseline trained on LLM-generated captions derived from metadata,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Generative Adversarial Networks and Image Synthesis
