Video-Guided Foley Sound Generation with Multimodal Controls

Ziyang Chen; Prem Seetharaman; Bryan Russell; Oriol Nieto; David; Bourgin; Andrew Owens; Justin Salamon

arXiv:2411.17698·cs.CV·March 18, 2025

Video-Guided Foley Sound Generation with Multimodal Controls

Ziyang Chen, Prem Seetharaman, Bryan Russell, Oriol Nieto, David, Bourgin, Andrew Owens, Justin Salamon

PDF

Open Access 1 Datasets

TL;DR

MultiFoley is a novel multimodal model that generates high-quality, synchronized sound effects for videos based on text, audio, and video inputs, supporting artistic and flexible sound design.

Contribution

It introduces a joint training approach on internet videos and professional sound effects, enabling versatile, high-quality sound generation with multimodal controls.

Findings

01

Outperforms existing methods in automated and human evaluations

02

Generates synchronized, high-quality sounds at 48kHz

03

Supports diverse conditional inputs including text, audio, and video

Abstract

Generating sound effects for videos often requires creating artistic sound effects that diverge significantly from real-life sources and flexible control in the sound design. To address this problem, we introduce MultiFoley, a model designed for video-guided sound generation that supports multimodal conditioning through text, audio, and video. Given a silent video and a text prompt, MultiFoley allows users to create clean sounds (e.g., skateboard wheels spinning without wind noise) or more whimsical sounds (e.g., making a lion's roar sound like a cat's meow). MultiFoley also allows users to choose reference audio from sound effects (SFX) libraries or partial videos for conditioning. A key novelty of our model lies in its joint training on both internet video datasets with low-quality audio and professional SFX recordings, enabling high-quality, full-bandwidth (48kHz) audio generation.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

czyang/MultiFoley-VGGSound-Test-Audio
dataset· 49 dl
49 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing