GVMGen: A General Video-to-Music Generation Model with Hierarchical   Attentions

Heda Zuo; Weitao You; Junxian Wu; Shihong Ren; Pei Chen; Mingxu Zhou,; Yujia Lu; Lingyun Sun

arXiv:2501.09972·cs.SD·April 21, 2025

GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions

Heda Zuo, Weitao You, Junxian Wu, Shihong Ren, Pei Chen, Mingxu Zhou,, Yujia Lu, Lingyun Sun

PDF

Open Access 1 Video

TL;DR

GVMGen is a versatile hierarchical attention-based model that generates high-quality, diverse music aligned with videos, even in zero-shot scenarios, and is supported by a new large-scale dataset and evaluation metrics.

Contribution

The paper introduces GVMGen, a novel hierarchical attention model for video-to-music generation, along with a large dataset and new metrics for alignment evaluation.

Findings

01

GVMGen outperforms previous models in music-video correspondence.

02

The model achieves high diversity and versatility in music generation.

03

Experimental results validate the effectiveness of hierarchical attentions and the dataset.

Abstract

Composing music for video is essential yet challenging, leading to a growing interest in automating music generation for video applications. Existing approaches often struggle to achieve robust music-video correspondence and generative diversity, primarily due to inadequate feature alignment methods and insufficient datasets. In this study, we present General Video-to-Music Generation model (GVMGen), designed for generating high-related music to the video input. Our model employs hierarchical attentions to extract and align video features with music in both spatial and temporal dimensions, ensuring the preservation of pertinent features while minimizing redundancy. Remarkably, our method is versatile, capable of generating multi-style music from different video inputs, even in zero-shot scenarios. We also propose an evaluation model along with two novel objective metrics for assessing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

GVMGen: A General Video-to-Music Generation Model With Hierarchical Attentions· underline

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies

MethodsALIGN