Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning
Shansong Liu, Atin Sakkeer Hussain, Chenshuo Sun, Ying Shan

TL;DR
This paper introduces MU-LLaMA, a model that enhances text-to-music generation by enabling music question answering and captioning, overcoming dataset limitations through a new dataset and achieving state-of-the-art results.
Contribution
The paper presents MU-LLaMA, a novel model for music understanding that leverages a new dataset and audio features to improve music question answering and captioning.
Findings
MU-LLaMA outperforms existing models in music question answering.
The MusicQA dataset enables effective training for open-ended music questions.
The model achieves high performance in music caption generation.
Abstract
Text-to-music generation (T2M-Gen) faces a major obstacle due to the scarcity of large-scale publicly available music datasets with natural language captions. To address this, we propose the Music Understanding LLaMA (MU-LLaMA), capable of answering music-related questions and generating captions for music files. Our model utilizes audio representations from a pretrained MERT model to extract music features. However, obtaining a suitable dataset for training the MU-LLaMA model remains challenging, as existing publicly accessible audio question answering datasets lack the necessary depth for open-ended music question answering. To fill this gap, we present a methodology for generating question-answer pairs from existing audio captioning datasets and introduce the MusicQA Dataset designed for answering open-ended music-related questions. The experiments demonstrate that the proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Music Technology and Sound Studies
