M$^{2}$UGen: Multi-modal Music Understanding and Generation with the   Power of Large Language Models

Shansong Liu; Atin Sakkeer Hussain; Qilong Wu; Chenshuo Sun; Ying Shan

arXiv:2311.11255·cs.SD·December 10, 2024·5 cites

M$^{2}$UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models

Shansong Liu, Atin Sakkeer Hussain, Qilong Wu, Chenshuo Sun, Ying Shan

PDF

Open Access 1 Repo 3 Models

TL;DR

The paper introduces M$^{2}$UGen, a framework that combines multi-modal understanding and music generation using large language models and pretrained models across text, images, videos, and audio, achieving state-of-the-art results.

Contribution

It presents a novel integrated framework for multi-modal music understanding and generation leveraging LLMs and pretrained models, filling a research gap in combined understanding and creation.

Findings

01

Achieves or surpasses current state-of-the-art performance.

02

Effectively integrates multi-modal data for music generation.

03

Supports diverse modalities including text, images, and videos.

Abstract

The current landscape of research leveraging large language models (LLMs) is experiencing a surge. Many works harness the powerful reasoning capabilities of these models to comprehend various modalities, such as text, speech, images, videos, etc. They also utilize LLMs to understand human intention and generate desired outputs like images, videos, and music. However, research that combines both understanding and generation using LLMs is still limited and in its nascent stage. To address this gap, we introduce a Multi-modal Music Understanding and Generation (M $^{2}$ UGen) framework that integrates LLM's abilities to comprehend and generate music for different modalities. The M $^{2}$ UGen framework is purpose-built to unlock creative potential from diverse sources of inspiration, encompassing music, image, and video through the use of pretrained MERT, ViT, and ViViT models, respectively.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shansongliu/M2UGen
jaxOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Diverse Musicological Studies