Foley Control: Aligning a Frozen Latent Text-to-Audio Model to Video

Ciara Rowles; Varun Jampani; Simon Donn\'e; Shimon Vainer; Julian Parker; Zach Evans

arXiv:2510.21581·cs.CV·October 27, 2025

Foley Control: Aligning a Frozen Latent Text-to-Audio Model to Video

Ciara Rowles, Varun Jampani, Simon Donn\'e, Shimon Vainer, Julian Parker, Zach Evans

PDF

TL;DR

Foley Control introduces a lightweight, modular method for aligning video and audio models by learning a small cross-attention bridge, enabling effective video-guided Foley sound synthesis without retraining large models.

Contribution

The paper presents a novel, efficient approach to align frozen video and audio models using a cross-attention bridge, maintaining modularity and requiring minimal training.

Findings

01

Achieves competitive temporal and semantic alignment on benchmarks.

02

Uses fewer trainable parameters than recent multi-modal systems.

03

Maintains prompt-driven controllability and modularity.

Abstract

Foley Control is a lightweight approach to video-guided Foley that keeps pretrained single-modality models frozen and learns only a small cross-attention bridge between them. We connect V-JEPA2 video embeddings to a frozen Stable Audio Open DiT text-to-audio (T2A) model by inserting compact video cross-attention after the model's existing text cross-attention, so prompts set global semantics while video refines timing and local dynamics. The frozen backbones retain strong marginals (video; audio given text) and the bridge learns the audio-video dependency needed for synchronization -- without retraining the audio prior. To cut memory and stabilize training, we pool video tokens before conditioning. On curated video-audio benchmarks, Foley Control delivers competitive temporal and semantic alignment with far fewer trainable parameters than recent multi-modal systems, while preserving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.