AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer

Pengjun Fang; Yingqing He; Yazhou Xing; Qifeng Chen; Ser-Nam Lim; Harry Yang

arXiv:2603.15597·cs.SD·March 23, 2026

AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer

Pengjun Fang, Yingqing He, Yazhou Xing, Qifeng Chen, Ser-Nam Lim, Harry Yang

PDF

Open Access 1 Models 3 Reviews

TL;DR

AC-Foley introduces an audio-conditioned video-to-audio synthesis model that enables fine-grained, precise sound generation and manipulation by directly leveraging reference audio, overcoming limitations of text-based methods.

Contribution

It proposes a novel audio-conditioned approach for video-to-audio synthesis that improves control, quality, and versatility over existing text-based methods.

Findings

01

Achieves state-of-the-art Foley generation performance with reference audio.

02

Remains competitive with video-to-audio methods without audio conditioning.

03

Enables fine-grained sound synthesis and timbre transfer.

Abstract

Existing video-to-audio (V2A) generation methods predominantly rely on text prompts alongside visual information to synthesize audio. However, two critical bottlenecks persist: semantic granularity gaps in training data, such as conflating acoustically distinct sounds under coarse labels, and textual ambiguity in describing micro-acoustic features. These bottlenecks make it difficult to perform fine-grained sound synthesis using text-controlled modes. To address these limitations, we propose AC-Foley, an audio-conditioned V2A model that directly leverages reference audio to achieve precise and fine-grained control over generated sounds. This approach enables fine-grained sound synthesis, timbre transfer, zero-shot sound generation, and improved audio quality. By directly conditioning on audio signals, our approach bypasses the semantic ambiguities of text descriptions while enabling…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 2

Strengths

- The problem setup is novel and interesting. use reference audio indeed provides more precise control over the generated audio. - The qualitative results are solid and impressive and show well the advantage of the proposed method. - The paper is well-written - The method is simple

Weaknesses

- The method is a bit less intuitive for me. The conditioning modalities use average pooling which collapses time but the temporal information seems to be well-preserved. Can the authors explain the reason for this? Have the author tried temporal representation for the temporal modalities (such as the video) - Missing discussion with multiple related work on V2A (e.g Rhythmic Foley, SSV2A, AV-LINK)

Reviewer 02Rating 6Confidence 4

Strengths

1. The idea of two-stage training is very interesting and has proven to be effective. AC-Foley separates the learning of audio features and audio-visual alignment into two stages. Starting from an easier setting with conditions from the same clip, and then extending to a harder case with unseen conditioning. This paradigm is quite inspiring. 2. This paper provides a rigorous evaluation against multiple baselines and different datasets. It not only outperforms in the audio-conditioned case, but

Weaknesses

1. The proposed method uses a multi-modal condition embedding in the flow-matching transformer. However, this multi-modal condition is fused from 3 different modalities (and one more time embedding), which naturally raises a concern about the quality of this conditional embedding. While the final results show that the model works well, the reviewer still wants to know what the intuition is behind this design, and what if we just use part of the conditional signals? 2. While the human study sho

Reviewer 03Rating 4Confidence 4

Strengths

- The paper unlocks some interesting applications, such as fine-grained sound synthesis (e.g., varying footsteps based on surface material), timbre transfer (applying one sound's tone to a different visual event), and at the same time the framework remains competitive, matching or closely approaching existing state-of-the-art performance in standard video-to-audio tasks. - The method supports variable-length audio conditioning and does not require the reference audio and generated audio to have

Weaknesses

- Overall the task itself is not too challenging and therefore the technical contribution is relatively limited. Most components are reused from previous works, and adding additional audio control seems straightforward with the help of all state-of-the-art modules. The method relies on a multimodal conditioning vector that integrates information from video, text, and audio. This vector modulates the transformer input using adaLN layers, which is a common conditional technique and utilized in oth

Code & Models

Models

🤗
FF2416/AC-Foley
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Speech and Audio Processing · Music Technology and Sound Studies