TL;DR
FoleyDesigner is a comprehensive framework that automates and enhances immersive stereo Foley sound creation for film, combining video analysis, diffusion models, and professional mixing to improve alignment and flexibility.
Contribution
It introduces a novel multi-agent system with diffusion models and LLM-driven mechanisms, along with the FilmStereo dataset, to improve spatio-temporal Foley generation and integration.
Findings
Achieves superior spatio-temporal alignment compared to baselines.
Supports professional audio standards like Dolby Atmos and ITU-R BS.775.
Provides interactive control and seamless pipeline integration.
Abstract
Foley art plays a pivotal role in enhancing immersive auditory experiences in film, yet manual creation of spatio-temporally aligned audio remains labor-intensive. We propose FoleyDesigner, a novel framework inspired by professional Foley workflows, integrating film clip analysis, spatio-temporally controllable Foley generation, and professional audio mixing capabilities. FoleyDesigner employs a multi-agent architecture for precise spatio-temporal analysis. It achieves spatio-temporal alignment through latent diffusion models trained on spatio-temporal cues extracted from video frames, combined with large language model (LLM)-driven hybrid mechanisms that emulate post-production practices in film industry. To address the lack of high-quality stereo audio datasets in film, we introduce FilmStereo, the first professional stereo audio dataset containing spatial metadata, precise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
