AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer
Pengjun Fang, Yingqing He, Yazhou Xing, Qifeng Chen, Ser-Nam Lim, Harry Yang

TL;DR
AC-Foley introduces an audio-conditioned video-to-audio synthesis model that enables fine-grained, precise sound generation and manipulation by directly leveraging reference audio, overcoming limitations of text-based methods.
Contribution
It proposes a novel audio-conditioned approach for video-to-audio synthesis that improves control, quality, and versatility over existing text-based methods.
Findings
Achieves state-of-the-art Foley generation performance with reference audio.
Remains competitive with video-to-audio methods without audio conditioning.
Enables fine-grained sound synthesis and timbre transfer.
Abstract
Existing video-to-audio (V2A) generation methods predominantly rely on text prompts alongside visual information to synthesize audio. However, two critical bottlenecks persist: semantic granularity gaps in training data, such as conflating acoustically distinct sounds under coarse labels, and textual ambiguity in describing micro-acoustic features. These bottlenecks make it difficult to perform fine-grained sound synthesis using text-controlled modes. To address these limitations, we propose AC-Foley, an audio-conditioned V2A model that directly leverages reference audio to achieve precise and fine-grained control over generated sounds. This approach enables fine-grained sound synthesis, timbre transfer, zero-shot sound generation, and improved audio quality. By directly conditioning on audio signals, our approach bypasses the semantic ambiguities of text descriptions while enabling…
Peer Reviews
Decision·ICLR 2026 Poster
- The problem setup is novel and interesting. use reference audio indeed provides more precise control over the generated audio. - The qualitative results are solid and impressive and show well the advantage of the proposed method. - The paper is well-written - The method is simple
- The method is a bit less intuitive for me. The conditioning modalities use average pooling which collapses time but the temporal information seems to be well-preserved. Can the authors explain the reason for this? Have the author tried temporal representation for the temporal modalities (such as the video) - Missing discussion with multiple related work on V2A (e.g Rhythmic Foley, SSV2A, AV-LINK)
1. The idea of two-stage training is very interesting and has proven to be effective. AC-Foley separates the learning of audio features and audio-visual alignment into two stages. Starting from an easier setting with conditions from the same clip, and then extending to a harder case with unseen conditioning. This paradigm is quite inspiring. 2. This paper provides a rigorous evaluation against multiple baselines and different datasets. It not only outperforms in the audio-conditioned case, but
1. The proposed method uses a multi-modal condition embedding in the flow-matching transformer. However, this multi-modal condition is fused from 3 different modalities (and one more time embedding), which naturally raises a concern about the quality of this conditional embedding. While the final results show that the model works well, the reviewer still wants to know what the intuition is behind this design, and what if we just use part of the conditional signals? 2. While the human study sho
- The paper unlocks some interesting applications, such as fine-grained sound synthesis (e.g., varying footsteps based on surface material), timbre transfer (applying one sound's tone to a different visual event), and at the same time the framework remains competitive, matching or closely approaching existing state-of-the-art performance in standard video-to-audio tasks. - The method supports variable-length audio conditioning and does not require the reference audio and generated audio to have
- Overall the task itself is not too challenging and therefore the technical contribution is relatively limited. Most components are reused from previous works, and adding additional audio control seems straightforward with the help of all state-of-the-art modules. The method relies on a multimodal conditioning vector that integrates information from video, text, and audio. This vector modulates the transformer input using adaLN layers, which is a common conditional technique and utilized in oth
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Speech and Audio Processing · Music Technology and Sound Studies
