FolAI: Synchronized Foley Sound Generation with Semantic and Temporal Alignment
Riccardo Fosco Gramaccioni, Christian Marinoni, Emilian Postolache,, Marco Comunit\`a, Luca Cosmo, Joshua D. Reiss, Danilo Comminiello

TL;DR
FolAI introduces a two-stage generative framework that automates Foley sound creation by aligning audio with video motion and user-defined semantics, enhancing efficiency and creative control.
Contribution
The paper presents a novel modular approach combining temporal structure estimation and diffusion-based semantic sound generation for synchronized video-audio Foley synthesis.
Findings
Reliable temporal alignment with visual motion
Semantic consistency with user input
Perceptually realistic audio outputs
Abstract
Traditional sound design workflows rely on manual alignment of audio events to visual cues, as in Foley sound design, where everyday actions like footsteps or object interactions are recreated to match the on-screen motion. This process is time-consuming, difficult to scale, and lacks automation tools that preserve creative intent. Despite recent advances in vision-to-audio generation, producing temporally coherent and semantically controllable sound effects from video remains a major challenge. To address these limitations, we introduce FolAI, a two-stage generative framework that decouples the when and the what of sound synthesis, i.e., the temporal structure extraction and the semantically guided generation, respectively. In the first stage, we estimate a smooth control signal from the video that captures the motion intensity and rhythmic structure over time, serving as a temporal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies
MethodsDiffusion · Focus
