MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models
Shansong Liu, Atin Sakkeer Hussain, Qilong Wu, Chenshuo Sun, Ying Shan

TL;DR
MuMu-LLaMA is a novel multi-modal music understanding and generation framework that leverages a new dataset and pre-trained encoders to outperform existing models across various music-related tasks.
Contribution
We introduce a comprehensive multi-modal music dataset and propose MuMu-LLaMA, a model integrating multiple pre-trained encoders for advanced music understanding and generation.
Findings
Outperforms state-of-the-art models in multiple music tasks
Demonstrates effective multi-modal music understanding and generation
Shows potential for diverse multi-modal music applications
Abstract
Research on large language models has advanced significantly across text, speech, images, and videos. However, multi-modal music understanding and generation remain underexplored due to the lack of well-annotated datasets. To address this, we introduce a dataset with 167.69 hours of multi-modal data, including text, images, videos, and music annotations. Based on this dataset, we propose MuMu-LLaMA, a model that leverages pre-trained encoders for music, images, and videos. For music generation, we integrate AudioLDM 2 and MusicGen. Our evaluation across four tasks--music understanding, text-to-music generation, prompt-based music editing, and multi-modal music generation--demonstrates that MuMu-LLaMA outperforms state-of-the-art models, showing its potential for multi-modal music applications.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Diverse Musicological Studies · Music Technology and Sound Studies
