VidMusician: Video-to-Music Generation with Semantic-Rhythmic Alignment via Hierarchical Visual Features
Sifei Li, Binxin Yang, Chunji Yin, Chong Sun, Yuxin Zhang, Weiming, Dong, Chen Li

TL;DR
VidMusician is a novel framework that generates music aligned with video semantics and rhythm by leveraging hierarchical visual features and a two-stage training process, advancing video-to-music synthesis.
Contribution
It introduces a parameter-efficient hierarchical visual feature integration method for improved semantic-rhythmic video-to-music generation, along with a new diverse dataset DVMSet.
Findings
Outperforms state-of-the-art methods on multiple metrics
Demonstrates robust performance on AI-generated videos
Effectively aligns music with video semantics and rhythm
Abstract
Video-to-music generation presents significant potential in video production, requiring the generated music to be both semantically and rhythmically aligned with the video. Achieving this alignment demands advanced music generation capabilities, sophisticated video understanding, and an efficient mechanism to learn the correspondence between the two modalities. In this paper, we propose VidMusician, a parameter-efficient video-to-music generation framework built upon text-to-music models. VidMusician leverages hierarchical visual features to ensure semantic and rhythmic alignment between video and music. Specifically, our approach utilizes global visual features as semantic conditions and local visual features as rhythmic cues. These features are integrated into the generative backbone via cross-attention and in-attention mechanisms, respectively. Through a two-stage training process,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Human Motion and Animation
