TL;DR
CounterFlow introduces a two-phase inference method for generating counterfactual video Foley, enabling sound-source identity contradiction while maintaining temporal synchronization with silent videos.
Contribution
It proposes a novel dual-phase sampling scheme that enhances counterfactual audio generation by suppressing visual cues and focusing on target prompts, outperforming existing methods.
Findings
Significantly improves counterfactual Foley generation quality.
Proposes a new metric for evaluating replacement quality.
Demonstrates effectiveness through video examples and code availability.
Abstract
We investigate Counterfactual Video Foley Generation, which aims to adopt a sound-source identity that contradicts the visual evidence while remaining temporally synchronized to a silent video. Existing Video&Text-to-Audio (VT2A) models struggle with this, often remaining anchored to the visually implied sound source when video and text contents disagree. We present ConterFlow, an inference-time dual-phase sampling scheme for pretrained flow-matching VT2A models. Phase 1 builds a video-derived temporal structure while suppressing the visually implied source; Phase 2 drops video conditioning to focus entirely on shaping audio timbre toward the target prompt. ConterFlow substantially improves counterfactual Video Foley generation compared to naive negative prompting and state-of-the-art baselines. To evaluate replacement quality, we propose a metric leveraging a text-audio co-embedding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
