Latent Diffusion Model Based Foley Sound Generation System For DCASE Challenge 2023 Task 7
Yi Yuan, Haohe Liu, Xubo Liu, Xiyuan Kang, Mark D. Plumbley, Wenwu, Wang

TL;DR
This paper introduces a diffusion-based Foley sound generation system for the DCASE 2023 challenge, leveraging transfer learning, language-audio embeddings, and filtering strategies to improve sound synthesis quality.
Contribution
The system combines AudioLDM with CLAP embeddings and filtering to enhance Foley sound generation, achieving significant performance improvements over baseline methods.
Findings
Achieved an average FAD score of 4.765, outperforming the baseline of 9.7.
Demonstrated the effectiveness of CLAP embeddings in improving sound quality.
Utilized transfer learning to adapt a large-scale pre-trained model to the Foley sound synthesis task.
Abstract
Foley sound presents the background sound for multimedia content and the generation of Foley sound involves computationally modelling sound effects with specialized techniques. In this work, we proposed a system for DCASE 2023 challenge task 7: Foley Sound Synthesis. The proposed system is based on AudioLDM, which is a diffusion-based text-to-audio generation model. To alleviate the data-hungry problem, the system first trained with large-scale datasets and then downstreamed into this DCASE task via transfer learning. Through experiments, we found out that the feature extracted by the encoder can significantly affect the performance of the generation model. Hence, we improve the results by leveraging the input label with related text embedding features obtained by a significant language model, i.e., contrastive language-audio pertaining (CLAP). In addition, we utilize a filtering…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
