Diff-V2M: A Hierarchical Conditional Diffusion Model with Explicit Rhythmic Modeling for Video-to-Music Generation

Shulei Ji; Zihao Wang; Jiaxing Yu; Xiangyuan Yang; Shuyu Li; Songruoyao Wu; Kejun Zhang

arXiv:2511.09090·cs.SD·November 13, 2025

Diff-V2M: A Hierarchical Conditional Diffusion Model with Explicit Rhythmic Modeling for Video-to-Music Generation

Shulei Ji, Zihao Wang, Jiaxing Yu, Xiangyuan Yang, Shuyu Li, Songruoyao Wu, Kejun Zhang

PDF

Open Access 1 Models

TL;DR

Diff-V2M introduces a hierarchical diffusion framework for video-to-music generation, explicitly modeling rhythm and integrating visual features to produce more synchronized and contextually coherent music.

Contribution

The paper presents a novel hierarchical conditional diffusion model with explicit rhythmic modeling and advanced feature fusion strategies for improved video-to-music generation.

Findings

01

Low-resolution ODF effectively models musical rhythm.

02

Diff-V2M outperforms existing models on multiple datasets.

03

The hierarchical cross-attention enhances feature integration.

Abstract

Video-to-music (V2M) generation aims to create music that aligns with visual content. However, two main challenges persist in existing methods: (1) the lack of explicit rhythm modeling hinders audiovisual temporal alignments; (2) effectively integrating various visual features to condition music generation remains non-trivial. To address these issues, we propose Diff-V2M, a general V2M framework based on a hierarchical conditional diffusion model, comprising two core components: visual feature extraction and conditional music generation. For rhythm modeling, we begin by evaluating several rhythmic representations, including low-resolution mel-spectrograms, tempograms, and onset detection functions (ODF), and devise a rhythmic predictor to infer them directly from videos. To ensure contextual and affective coherence, we also extract semantic and emotional features. All features are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
TaylorJi/Diff-V2M
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Neuroscience and Music Perception