Controllable Video-to-Music Generation with Multiple Time-Varying Conditions
Junxian Wu, Weitao You, Heda Zuo, Dengming Zhang, Pei Chen, Lingyun Sun

TL;DR
This paper introduces a multi-condition guided video-to-music generation framework that allows for enhanced control over music synthesis by incorporating multiple time-varying conditions and a two-stage training process.
Contribution
It presents a novel multi-condition control framework with specialized modules for feature selection, temporal alignment, and dynamic fusion, improving over existing methods.
Findings
Outperforms existing V2M methods in subjective evaluations
Achieves better control and alignment with user expectations
Demonstrates effective multi-condition integration in music generation
Abstract
Music enhances video narratives and emotions, driving demand for automatic video-to-music (V2M) generation. However, existing V2M methods relying solely on visual features or supplementary textual inputs generate music in a black-box manner, often failing to meet user expectations. To address this challenge, we propose a novel multi-condition guided V2M generation framework that incorporates multiple time-varying conditions for enhanced control over music generation. Our method uses a two-stage training strategy that enables learning of V2M fundamentals and audiovisual temporal synchronization while meeting users' needs for multi-condition control. In the first stage, we introduce a fine-grained feature selection module and a progressive temporal alignment attention mechanism to ensure flexible feature alignment. For the second stage, we develop a dynamic conditional fusion module and a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
