M$^{2}$UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models
Shansong Liu, Atin Sakkeer Hussain, Qilong Wu, Chenshuo Sun, Ying Shan

TL;DR
The paper introduces M$^{2}$UGen, a framework that combines multi-modal understanding and music generation using large language models and pretrained models across text, images, videos, and audio, achieving state-of-the-art results.
Contribution
It presents a novel integrated framework for multi-modal music understanding and generation leveraging LLMs and pretrained models, filling a research gap in combined understanding and creation.
Findings
Achieves or surpasses current state-of-the-art performance.
Effectively integrates multi-modal data for music generation.
Supports diverse modalities including text, images, and videos.
Abstract
The current landscape of research leveraging large language models (LLMs) is experiencing a surge. Many works harness the powerful reasoning capabilities of these models to comprehend various modalities, such as text, speech, images, videos, etc. They also utilize LLMs to understand human intention and generate desired outputs like images, videos, and music. However, research that combines both understanding and generation using LLMs is still limited and in its nascent stage. To address this gap, we introduce a Multi-modal Music Understanding and Generation (MUGen) framework that integrates LLM's abilities to comprehend and generate music for different modalities. The MUGen framework is purpose-built to unlock creative potential from diverse sources of inspiration, encompassing music, image, and video through the use of pretrained MERT, ViT, and ViViT models, respectively.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Diverse Musicological Studies
