Multimodal Music Generation with Explicit Bridges and Retrieval   Augmentation

Baisen Wang; Le Zhuo; Zhaokai Wang; Chenxi Bao; Wu Chengjing; Xuecheng; Nie; Jiao Dai; Jizhong Han; Yue Liao; Si Liu

arXiv:2412.09428·cs.CV·December 13, 2024

Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation

Baisen Wang, Le Zhuo, Zhaokai Wang, Chenxi Bao, Wu Chengjing, Xuecheng, Nie, Jiao Dai, Jizhong Han, Yue Liao, Si Liu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel multimodal music generation framework that uses explicit text and music bridges to improve alignment, controllability, and quality across various input modalities like videos, images, and text.

Contribution

It proposes the Visuals Music Bridge (VMB), a new method combining detailed visual-to-text descriptions and targeted music retrieval to enhance multimodal music generation.

Findings

01

VMB improves music quality and alignment over previous methods.

02

The framework enables effective controllability in music generation.

03

Experiments demonstrate versatility across multiple input modalities.

Abstract

Multimodal music generation aims to produce music from diverse input modalities, including text, videos, and images. Existing methods use a common embedding space for multimodal fusion. Despite their effectiveness in other modalities, their application in multimodal music generation faces challenges of data scarcity, weak cross-modal alignment, and limited controllability. This paper addresses these issues by using explicit bridges of text and music for multimodal alignment. We introduce a novel method named Visuals Music Bridge (VMB). Specifically, a Multimodal Music Description Model converts visual inputs into detailed textual descriptions to provide the text bridge; a Dual-track Music Retrieval module that combines broad and targeted retrieval strategies to provide the music bridge and enable user control. Finally, we design an Explicitly Conditioned Music Generation framework to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wbs2788/vmb
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Human Motion and Animation