ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling

Jianxuan Yang; Xinyue Guo; Zhi Cheng; Kai Wang; Lipan Zhang; Jinjie Hu; Qiang Ji; Yihua Cao; Yihao Meng; Zhaoyue Cui; Mengmei Liu; Meng Meng; Jian Luan

arXiv:2604.15086·cs.MM·April 17, 2026

ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling

Jianxuan Yang, Xinyue Guo, Zhi Cheng, Kai Wang, Lipan Zhang, Jinjie Hu, Qiang Ji, Yihua Cao, Yihao Meng, Zhaoyue Cui, Mengmei Liu, Meng Meng, Jian Luan

PDF

1 Repo

TL;DR

ControlFoley introduces a unified framework for video-to-audio generation that enhances controllability and alignment across modalities, addressing existing challenges with a novel encoding paradigm and a new benchmark.

Contribution

It proposes a multimodal V2A framework with temporal-timbre decoupling, a robust training scheme, and a benchmark for evaluating textual controllability under visual-text conflict.

Findings

01

Achieves state-of-the-art performance in multiple V2A tasks.

02

Demonstrates superior controllability under cross-modal conflict.

03

Maintains high synchronization and audio quality.

Abstract

Recent advances in video-to-audio (V2A) generation enable high-quality audio synthesis from visual content, yet achieving robust and fine-grained controllability remains challenging. Existing methods suffer from weak textual controllability under visual-text conflict and imprecise stylistic control due to entangled temporal and timbre information in reference audio. Moreover, the lack of standardized benchmarks limits systematic evaluation. We propose ControlFoley, a unified multimodal V2A framework that enables precise control over video, text, and reference audio. We introduce a joint visual encoding paradigm that integrates CLIP with a spatio-temporal audio-visual encoder to improve alignment and textual controllability. We further propose temporal-timbre decoupling to suppress redundant temporal cues while preserving discriminative timbre features. In addition, we design a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://yjx-research.github.io/ControlFoley
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.