VidMuse: A Simple Video-to-Music Generation Framework with   Long-Short-Term Modeling

Zeyue Tian; Zhaoyang Liu; Ruibin Yuan; Jiahao Pan; Qifeng Liu; Xu Tan,; Qifeng Chen; Wei Xue; Yike Guo

arXiv:2406.04321·cs.CV·May 8, 2025

VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling

Zeyue Tian, Zhaoyang Liu, Ruibin Yuan, Jiahao Pan, Qifeng Liu, Xu Tan,, Qifeng Chen, Wei Xue, Yike Guo

PDF

Open Access 1 Repo 2 Models 2 Datasets

TL;DR

VidMuse is a new framework for generating high-quality, semantically aligned music from videos, utilizing a large dataset and long-short-term modeling to improve coherence and diversity.

Contribution

We introduce VidMuse, a simple yet effective video-to-music generation framework that leverages a large dataset and long-short-term modeling for improved audio-visual alignment.

Findings

01

Outperforms existing models in audio quality and diversity

02

Produces semantically aligned music with video content

03

Utilizes a large-scale dataset of 360K video-music pairs

Abstract

In this work, we systematically study music generation conditioned solely on the video. First, we present a large-scale dataset comprising 360K video-music pairs, including various genres such as movie trailers, advertisements, and documentaries. Furthermore, we propose VidMuse, a simple framework for generating music aligned with video inputs. VidMuse stands out by producing high-fidelity music that is both acoustically and semantically aligned with the video. By incorporating local and global visual cues, VidMuse enables the creation of musically coherent audio tracks that consistently match the video content through Long-Short-Term modeling. Through extensive experiments, VidMuse outperforms existing models in terms of audio quality, diversity, and audio-visual alignment. The code and datasets are available at https://vidmuse.github.io/.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zeyuet/vidmuse
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Music and Audio Processing · Multimedia Communication and Technology