SyncFusion: Multimodal Onset-synchronized Video-to-Audio Foley Synthesis
Marco Comunit\`a, Riccardo F. Gramaccioni, Emilian Postolache,, Emanuele Rodol\`a, Danilo Comminiello, Joshua D. Reiss

TL;DR
This paper introduces SyncFusion, a system that automatically extracts action onsets from videos to generate synchronized sound effects using a diffusion model, simplifying sound design and synchronization tasks.
Contribution
The paper presents a novel multimodal system that automates onset detection and sound synthesis, reducing manual effort in video-to-audio synchronization for sound design.
Findings
Successfully generates synchronized sound effects from videos.
Reduces manual effort in editing and synchronizing audio.
Provides open-source code and pretrained models for reproducibility.
Abstract
Sound design involves creatively selecting, recording, and editing sound effects for various media like cinema, video games, and virtual/augmented reality. One of the most time-consuming steps when designing sound is synchronizing audio with video. In some cases, environmental recordings from video shoots are available, which can aid in the process. However, in video games and animations, no reference audio exists, requiring manual annotation of event timings from the video. We propose a system to extract repetitive actions onsets from a video, which are then used - in conjunction with audio or textual embeddings - to condition a diffusion model trained to generate a new synchronized sound effects audio track. In this way, we leave complete creative control to the sound designer while removing the burden of synchronization with video. Furthermore, editing the onset track or changing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Video Analysis and Summarization
MethodsDiffusion
