Latent Diffusion Model Based Foley Sound Generation System For DCASE   Challenge 2023 Task 7

Yi Yuan; Haohe Liu; Xubo Liu; Xiyuan Kang; Mark D. Plumbley; Wenwu; Wang

arXiv:2305.15905·cs.SD·September 18, 2023·2 cites

Latent Diffusion Model Based Foley Sound Generation System For DCASE Challenge 2023 Task 7

Yi Yuan, Haohe Liu, Xubo Liu, Xiyuan Kang, Mark D. Plumbley, Wenwu, Wang

PDF

Open Access

TL;DR

This paper introduces a diffusion-based Foley sound generation system for the DCASE 2023 challenge, leveraging transfer learning, language-audio embeddings, and filtering strategies to improve sound synthesis quality.

Contribution

The system combines AudioLDM with CLAP embeddings and filtering to enhance Foley sound generation, achieving significant performance improvements over baseline methods.

Findings

01

Achieved an average FAD score of 4.765, outperforming the baseline of 9.7.

02

Demonstrated the effectiveness of CLAP embeddings in improving sound quality.

03

Utilized transfer learning to adapt a large-scale pre-trained model to the Foley sound synthesis task.

Abstract

Foley sound presents the background sound for multimedia content and the generation of Foley sound involves computationally modelling sound effects with specialized techniques. In this work, we proposed a system for DCASE 2023 challenge task 7: Foley Sound Synthesis. The proposed system is based on AudioLDM, which is a diffusion-based text-to-audio generation model. To alleviate the data-hungry problem, the system first trained with large-scale datasets and then downstreamed into this DCASE task via transfer learning. Through experiments, we found out that the feature extracted by the encoder can significantly affect the performance of the generation model. Hence, we improve the results by leveraging the input label with related text embedding features obtained by a significant language model, i.e., contrastive language-audio pertaining (CLAP). In addition, we utilize a filtering…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing