Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models
Jiajun Li, Tianze Xu, Xuesong Chen, Xinrui Yao, Shuchang Liu

TL;DR
Mozart's Touch is a multi-modal music generation framework that leverages large language models to interpret visual and textual inputs, enabling efficient and transparent creation of emotionally aligned music without extensive model training.
Contribution
The paper introduces Mozart's Touch, a novel multi-modal music generation framework that uses LLMs for interpretation, avoiding fine-tuning and enhancing efficiency and transparency.
Findings
Outperforms state-of-the-art models in evaluations
Effectively interprets cross-modal inputs with LLM-Bridge
Provides a transparent, efficient music generation process
Abstract
In recent years, AI-Generated Content (AIGC) has witnessed rapid advancements, facilitating the creation of music, images, and other artistic forms across a wide range of industries. However, current models for image- and video-to-music synthesis struggle to capture the nuanced emotions and atmosphere conveyed by visual content. To fill this gap, we propose Mozart's Touch, a multi-modal music generation framework capable of generating music aligned with cross-modal inputs such as images, videos, and text. The framework consists of three key components: Multi-modal Captioning Module, Large Language Model (LLM) understanding \& Bridging Module, and Music Generation Module. Unlike traditional end-to-end methods, Mozart's Touch uses LLMs to accurately interpret visual elements without requiring the training or fine-tuning of music generation models, providing efficiency and transparency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Human Motion and Animation
