Video-Guided Foley Sound Generation with Multimodal Controls
Ziyang Chen, Prem Seetharaman, Bryan Russell, Oriol Nieto, David, Bourgin, Andrew Owens, Justin Salamon

TL;DR
MultiFoley is a novel multimodal model that generates high-quality, synchronized sound effects for videos based on text, audio, and video inputs, supporting artistic and flexible sound design.
Contribution
It introduces a joint training approach on internet videos and professional sound effects, enabling versatile, high-quality sound generation with multimodal controls.
Findings
Outperforms existing methods in automated and human evaluations
Generates synchronized, high-quality sounds at 48kHz
Supports diverse conditional inputs including text, audio, and video
Abstract
Generating sound effects for videos often requires creating artistic sound effects that diverge significantly from real-life sources and flexible control in the sound design. To address this problem, we introduce MultiFoley, a model designed for video-guided sound generation that supports multimodal conditioning through text, audio, and video. Given a silent video and a text prompt, MultiFoley allows users to create clean sounds (e.g., skateboard wheels spinning without wind noise) or more whimsical sounds (e.g., making a lion's roar sound like a cat's meow). MultiFoley also allows users to choose reference audio from sound effects (SFX) libraries or partial videos for conditioning. A key novelty of our model lies in its joint training on both internet video datasets with low-quality audio and professional SFX recordings, enabling high-quality, full-bandwidth (48kHz) audio generation.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing
