Music Understanding LLaMA: Advancing Text-to-Music Generation with   Question Answering and Captioning

Shansong Liu; Atin Sakkeer Hussain; Chenshuo Sun; Ying Shan

arXiv:2308.11276·cs.SD·August 23, 2023·2 cites

Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning

Shansong Liu, Atin Sakkeer Hussain, Chenshuo Sun, Ying Shan

PDF

Open Access 3 Repos

TL;DR

This paper introduces MU-LLaMA, a model that enhances text-to-music generation by enabling music question answering and captioning, overcoming dataset limitations through a new dataset and achieving state-of-the-art results.

Contribution

The paper presents MU-LLaMA, a novel model for music understanding that leverages a new dataset and audio features to improve music question answering and captioning.

Findings

01

MU-LLaMA outperforms existing models in music question answering.

02

The MusicQA dataset enables effective training for open-ended music questions.

03

The model achieves high performance in music caption generation.

Abstract

Text-to-music generation (T2M-Gen) faces a major obstacle due to the scarcity of large-scale publicly available music datasets with natural language captions. To address this, we propose the Music Understanding LLaMA (MU-LLaMA), capable of answering music-related questions and generating captions for music files. Our model utilizes audio representations from a pretrained MERT model to extract music features. However, obtaining a suitable dataset for training the MU-LLaMA model remains challenging, as existing publicly accessible audio question answering datasets lack the necessary depth for open-ended music question answering. To fill this gap, we present a methodology for generating question-answer pairs from existing audio captioning datasets and introduce the MusicQA Dataset designed for answering open-ended music-related questions. The experiments demonstrate that the proposed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Music Technology and Sound Studies