MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large   Language Models

Shansong Liu; Atin Sakkeer Hussain; Qilong Wu; Chenshuo Sun; Ying Shan

arXiv:2412.06660·cs.SD·December 10, 2024

MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models

Shansong Liu, Atin Sakkeer Hussain, Qilong Wu, Chenshuo Sun, Ying Shan

PDF

Open Access 2 Repos

TL;DR

MuMu-LLaMA is a novel multi-modal music understanding and generation framework that leverages a new dataset and pre-trained encoders to outperform existing models across various music-related tasks.

Contribution

We introduce a comprehensive multi-modal music dataset and propose MuMu-LLaMA, a model integrating multiple pre-trained encoders for advanced music understanding and generation.

Findings

01

Outperforms state-of-the-art models in multiple music tasks

02

Demonstrates effective multi-modal music understanding and generation

03

Shows potential for diverse multi-modal music applications

Abstract

Research on large language models has advanced significantly across text, speech, images, and videos. However, multi-modal music understanding and generation remain underexplored due to the lack of well-annotated datasets. To address this, we introduce a dataset with 167.69 hours of multi-modal data, including text, images, videos, and music annotations. Based on this dataset, we propose MuMu-LLaMA, a model that leverages pre-trained encoders for music, images, and videos. For music generation, we integrate AudioLDM 2 and MusicGen. Our evaluation across four tasks--music understanding, text-to-music generation, prompt-based music editing, and multi-modal music generation--demonstrates that MuMu-LLaMA outperforms state-of-the-art models, showing its potential for multi-modal music applications.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Diverse Musicological Studies · Music Technology and Sound Studies