Controllable Video-to-Music Generation with Multiple Time-Varying Conditions

Junxian Wu; Weitao You; Heda Zuo; Dengming Zhang; Pei Chen; Lingyun Sun

arXiv:2507.20627·cs.MM·July 29, 2025

Controllable Video-to-Music Generation with Multiple Time-Varying Conditions

Junxian Wu, Weitao You, Heda Zuo, Dengming Zhang, Pei Chen, Lingyun Sun

PDF

TL;DR

This paper introduces a multi-condition guided video-to-music generation framework that allows for enhanced control over music synthesis by incorporating multiple time-varying conditions and a two-stage training process.

Contribution

It presents a novel multi-condition control framework with specialized modules for feature selection, temporal alignment, and dynamic fusion, improving over existing methods.

Findings

01

Outperforms existing V2M methods in subjective evaluations

02

Achieves better control and alignment with user expectations

03

Demonstrates effective multi-condition integration in music generation

Abstract

Music enhances video narratives and emotions, driving demand for automatic video-to-music (V2M) generation. However, existing V2M methods relying solely on visual features or supplementary textual inputs generate music in a black-box manner, often failing to meet user expectations. To address this challenge, we propose a novel multi-condition guided V2M generation framework that incorporates multiple time-varying conditions for enhanced control over music generation. Our method uses a two-stage training strategy that enables learning of V2M fundamentals and audiovisual temporal synchronization while meeting users' needs for multi-condition control. In the first stage, we introduce a fine-grained feature selection module and a progressive temporal alignment attention mechanism to ensure flexible feature alignment. For the second stage, we develop a dynamic conditional fusion module and a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.