Video Echoed in Music: Semantic, Temporal, and Rhythmic Alignment for Video-to-Music Generation
Xinyi Tong, Yiran Zhu, Jishang Chen, Chunru Zhan, Tianle Wang, Sirui Zhang, Nian Liu, Tiezheng Ge, Duo Xu, Xin Jin, Feng Yu, Song-Chun Zhu

TL;DR
This paper introduces VeM, a novel video-to-music generation method that achieves semantic, temporal, and rhythmic alignment by leveraging hierarchical video parsing, cross-attention mechanisms, and beat synchronization techniques, supported by a new dataset and metrics.
Contribution
VeM is the first approach to comprehensively align semantic content, timing, and rhythm in video-to-music generation using hierarchical parsing and specialized synchronization modules.
Findings
VeM outperforms existing methods in semantic relevance.
VeM achieves high rhythmic synchronization accuracy.
The new dataset enables stricter evaluation of beat alignment.
Abstract
Video-to-Music generation seeks to generate musically appropriate background music that enhances audiovisual immersion for videos. However, current approaches suffer from two critical limitations: 1) incomplete representation of video details, leading to weak alignment, and 2) inadequate temporal and rhythmic correspondence, particularly in achieving precise beat synchronization. To address the challenges, we propose Video Echoed in Music (VeM), a latent music diffusion that generates high-quality soundtracks with semantic, temporal, and rhythmic alignment for input videos. To capture video details comprehensively, VeM employs a hierarchical video parsing that acts as a music conductor, orchestrating multi-level information across modalities. Modality-specific encoders, coupled with a storyboard-guided cross-attention mechanism (SG-CAtt), integrate semantic cues while maintaining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMusic Technology and Sound Studies · Music and Audio Processing · Generative Adversarial Networks and Image Synthesis
