TL;DR
ControlFoley introduces a unified framework for video-to-audio generation that enhances controllability and alignment across modalities, addressing existing challenges with a novel encoding paradigm and a new benchmark.
Contribution
It proposes a multimodal V2A framework with temporal-timbre decoupling, a robust training scheme, and a benchmark for evaluating textual controllability under visual-text conflict.
Findings
Achieves state-of-the-art performance in multiple V2A tasks.
Demonstrates superior controllability under cross-modal conflict.
Maintains high synchronization and audio quality.
Abstract
Recent advances in video-to-audio (V2A) generation enable high-quality audio synthesis from visual content, yet achieving robust and fine-grained controllability remains challenging. Existing methods suffer from weak textual controllability under visual-text conflict and imprecise stylistic control due to entangled temporal and timbre information in reference audio. Moreover, the lack of standardized benchmarks limits systematic evaluation. We propose ControlFoley, a unified multimodal V2A framework that enables precise control over video, text, and reference audio. We introduce a joint visual encoding paradigm that integrates CLIP with a spatio-temporal audio-visual encoder to improve alignment and textual controllability. We further propose temporal-timbre decoupling to suppress redundant temporal cues while preserving discriminative timbre features. In addition, we design a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
