Video Echoed in Music: Semantic, Temporal, and Rhythmic Alignment for Video-to-Music Generation

Xinyi Tong; Yiran Zhu; Jishang Chen; Chunru Zhan; Tianle Wang; Sirui Zhang; Nian Liu; Tiezheng Ge; Duo Xu; Xin Jin; Feng Yu; Song-Chun Zhu

arXiv:2511.09585·cs.SD·December 15, 2025

Video Echoed in Music: Semantic, Temporal, and Rhythmic Alignment for Video-to-Music Generation

Xinyi Tong, Yiran Zhu, Jishang Chen, Chunru Zhan, Tianle Wang, Sirui Zhang, Nian Liu, Tiezheng Ge, Duo Xu, Xin Jin, Feng Yu, Song-Chun Zhu

PDF

Open Access 1 Video

TL;DR

This paper introduces VeM, a novel video-to-music generation method that achieves semantic, temporal, and rhythmic alignment by leveraging hierarchical video parsing, cross-attention mechanisms, and beat synchronization techniques, supported by a new dataset and metrics.

Contribution

VeM is the first approach to comprehensively align semantic content, timing, and rhythm in video-to-music generation using hierarchical parsing and specialized synchronization modules.

Findings

01

VeM outperforms existing methods in semantic relevance.

02

VeM achieves high rhythmic synchronization accuracy.

03

The new dataset enables stricter evaluation of beat alignment.

Abstract

Video-to-Music generation seeks to generate musically appropriate background music that enhances audiovisual immersion for videos. However, current approaches suffer from two critical limitations: 1) incomplete representation of video details, leading to weak alignment, and 2) inadequate temporal and rhythmic correspondence, particularly in achieving precise beat synchronization. To address the challenges, we propose Video Echoed in Music (VeM), a latent music diffusion that generates high-quality soundtracks with semantic, temporal, and rhythmic alignment for input videos. To capture video details comprehensively, VeM employs a hierarchical video parsing that acts as a music conductor, orchestrating multi-level information across modalities. Modality-specific encoders, coupled with a storyboard-guided cross-attention mechanism (SG-CAtt), integrate semantic cues while maintaining…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Video Echoed in Music: Semantic, Temporal, and Rhythmic Alignment for Video-to-Music Generation· underline

Taxonomy

TopicsMusic Technology and Sound Studies · Music and Audio Processing · Generative Adversarial Networks and Image Synthesis