VidMusician: Video-to-Music Generation with Semantic-Rhythmic Alignment   via Hierarchical Visual Features

Sifei Li; Binxin Yang; Chunji Yin; Chong Sun; Yuxin Zhang; Weiming; Dong; Chen Li

arXiv:2412.06296·cs.SD·December 10, 2024

VidMusician: Video-to-Music Generation with Semantic-Rhythmic Alignment via Hierarchical Visual Features

Sifei Li, Binxin Yang, Chunji Yin, Chong Sun, Yuxin Zhang, Weiming, Dong, Chen Li

PDF

Open Access

TL;DR

VidMusician is a novel framework that generates music aligned with video semantics and rhythm by leveraging hierarchical visual features and a two-stage training process, advancing video-to-music synthesis.

Contribution

It introduces a parameter-efficient hierarchical visual feature integration method for improved semantic-rhythmic video-to-music generation, along with a new diverse dataset DVMSet.

Findings

01

Outperforms state-of-the-art methods on multiple metrics

02

Demonstrates robust performance on AI-generated videos

03

Effectively aligns music with video semantics and rhythm

Abstract

Video-to-music generation presents significant potential in video production, requiring the generated music to be both semantically and rhythmically aligned with the video. Achieving this alignment demands advanced music generation capabilities, sophisticated video understanding, and an efficient mechanism to learn the correspondence between the two modalities. In this paper, we propose VidMusician, a parameter-efficient video-to-music generation framework built upon text-to-music models. VidMusician leverages hierarchical visual features to ensure semantic and rhythmic alignment between video and music. Specifically, our approach utilizes global visual features as semantic conditions and local visual features as rhythmic cues. These features are integrated into the generative backbone via cross-attention and in-attention mechanisms, respectively. Through a two-stage training process,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Human Motion and Animation