Foley-Flow: Coordinated Video-to-Audio Generation with Masked Audio-Visual Alignment and Dynamic Conditional Flows

Shentong Mo; Yibing Song

arXiv:2603.08126·cs.CV·March 10, 2026

Foley-Flow: Coordinated Video-to-Audio Generation with Masked Audio-Visual Alignment and Dynamic Conditional Flows

Shentong Mo, Yibing Song

PDF

Open Access

TL;DR

FoleyFlow introduces a novel method for video-to-audio generation that aligns unimodal AV encoders through masked modeling and employs dynamic conditional flows guided by video features, achieving superior semantic and rhythmic coherence.

Contribution

The paper presents FoleyFlow, a new approach combining masked AV encoder alignment with dynamic conditional flows for improved video-to-audio generation.

Findings

01

Outperforms existing methods on standard benchmarks.

02

Achieves better semantic and rhythmic coherence in generated audio.

03

Demonstrates effectiveness of masked AV alignment and dynamic flow guidance.

Abstract

Coordinated audio generation based on video inputs typically requires a strict audio-visual (AV) alignment, where both semantics and rhythmics of the generated audio segments shall correspond to those in the video frames. Previous studies leverage a two-stage design where the AV encoders are firstly aligned via contrastive learning, then the encoded video representations guide the audio generation process. We observe that both contrastive learning and global video guidance are effective in aligning overall AV semantics while limiting temporally rhythmic synchronization. In this work, we propose FoleyFlow to first align unimodal AV encoders via masked modeling training, where the masked audio segments are recovered under the guidance of the corresponding video segments. After training, the AV encoders which are separately pretrained using only unimodal data are aligned with semantic and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Music Technology and Sound Studies · Speech and Audio Processing