FolAI: Synchronized Foley Sound Generation with Semantic and Temporal   Alignment

Riccardo Fosco Gramaccioni; Christian Marinoni; Emilian Postolache,; Marco Comunit\`a; Luca Cosmo; Joshua D. Reiss; Danilo Comminiello

arXiv:2412.15023·cs.SD·May 6, 2025

FolAI: Synchronized Foley Sound Generation with Semantic and Temporal Alignment

Riccardo Fosco Gramaccioni, Christian Marinoni, Emilian Postolache,, Marco Comunit\`a, Luca Cosmo, Joshua D. Reiss, Danilo Comminiello

PDF

Open Access

TL;DR

FolAI introduces a two-stage generative framework that automates Foley sound creation by aligning audio with video motion and user-defined semantics, enhancing efficiency and creative control.

Contribution

The paper presents a novel modular approach combining temporal structure estimation and diffusion-based semantic sound generation for synchronized video-audio Foley synthesis.

Findings

01

Reliable temporal alignment with visual motion

02

Semantic consistency with user input

03

Perceptually realistic audio outputs

Abstract

Traditional sound design workflows rely on manual alignment of audio events to visual cues, as in Foley sound design, where everyday actions like footsteps or object interactions are recreated to match the on-screen motion. This process is time-consuming, difficult to scale, and lacks automation tools that preserve creative intent. Despite recent advances in vision-to-audio generation, producing temporally coherent and semantically controllable sound effects from video remains a major challenge. To address these limitations, we introduce FolAI, a two-stage generative framework that decouples the when and the what of sound synthesis, i.e., the temporal structure extraction and the semantically guided generation, respectively. In the first stage, we estimate a smooth control signal from the video that captures the motion intensity and rhythmic structure over time, serving as a temporal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies

MethodsDiffusion · Focus