Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation
Baisen Wang, Le Zhuo, Zhaokai Wang, Chenxi Bao, Wu Chengjing, Xuecheng, Nie, Jiao Dai, Jizhong Han, Yue Liao, Si Liu

TL;DR
This paper introduces a novel multimodal music generation framework that uses explicit text and music bridges to improve alignment, controllability, and quality across various input modalities like videos, images, and text.
Contribution
It proposes the Visuals Music Bridge (VMB), a new method combining detailed visual-to-text descriptions and targeted music retrieval to enhance multimodal music generation.
Findings
VMB improves music quality and alignment over previous methods.
The framework enables effective controllability in music generation.
Experiments demonstrate versatility across multiple input modalities.
Abstract
Multimodal music generation aims to produce music from diverse input modalities, including text, videos, and images. Existing methods use a common embedding space for multimodal fusion. Despite their effectiveness in other modalities, their application in multimodal music generation faces challenges of data scarcity, weak cross-modal alignment, and limited controllability. This paper addresses these issues by using explicit bridges of text and music for multimodal alignment. We introduce a novel method named Visuals Music Bridge (VMB). Specifically, a Multimodal Music Description Model converts visual inputs into detailed textual descriptions to provide the text bridge; a Dual-track Music Retrieval module that combines broad and targeted retrieval strategies to provide the music bridge and enable user control. Finally, we design an Explicitly Conditioned Music Generation framework to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Human Motion and Animation
