GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions
Heda Zuo, Weitao You, Junxian Wu, Shihong Ren, Pei Chen, Mingxu Zhou,, Yujia Lu, Lingyun Sun

TL;DR
GVMGen is a versatile hierarchical attention-based model that generates high-quality, diverse music aligned with videos, even in zero-shot scenarios, and is supported by a new large-scale dataset and evaluation metrics.
Contribution
The paper introduces GVMGen, a novel hierarchical attention model for video-to-music generation, along with a large dataset and new metrics for alignment evaluation.
Findings
GVMGen outperforms previous models in music-video correspondence.
The model achieves high diversity and versatility in music generation.
Experimental results validate the effectiveness of hierarchical attentions and the dataset.
Abstract
Composing music for video is essential yet challenging, leading to a growing interest in automating music generation for video applications. Existing approaches often struggle to achieve robust music-video correspondence and generative diversity, primarily due to inadequate feature alignment methods and insufficient datasets. In this study, we present General Video-to-Music Generation model (GVMGen), designed for generating high-related music to the video input. Our model employs hierarchical attentions to extract and align video features with music in both spatial and temporal dimensions, ensuring the preservation of pertinent features while minimizing redundancy. Remarkably, our method is versatile, capable of generating multi-style music from different video inputs, even in zero-shot scenarios. We also propose an evaluation model along with two novel objective metrics for assessing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies
MethodsALIGN
