A Survey on Audio Diffusion Models: Text To Speech Synthesis and Enhancement in Generative AI
Chenshuang Zhang, Chaoning Zhang, Sheng Zheng, Mengchun Zhang, and Maryam Qamar, Sung-Ho Bae, In So Kweon

TL;DR
This survey reviews recent advances in audio diffusion models for text-to-speech synthesis and speech enhancement, highlighting their categorization, methodologies, and experimental results in the context of generative AI.
Contribution
It provides a comprehensive overview of recent diffusion-based speech synthesis and enhancement methods, filling gaps left by previous surveys.
Findings
Diffusion models are effective in speech synthesis and enhancement.
Categorization of methods into acoustic, vocoder, and end-to-end frameworks.
Experimental results demonstrate the superiority of diffusion-based approaches.
Abstract
Generative AI has demonstrated impressive performance in various fields, among which speech synthesis is an interesting direction. With the diffusion model as the most popular generative model, numerous works have attempted two active tasks: text to speech and speech enhancement. This work conducts a survey on audio diffusion model, which is complementary to existing surveys that either lack the recent progress of diffusion-based speech synthesis or highlight an overall picture of applying diffusion model in multiple fields. Specifically, this work first briefly introduces the background of audio and diffusion model. As for the text-to-speech task, we divide the methods into three categories based on the stage where diffusion model is adopted: acoustic model, vocoder and end-to-end framework. Moreover, we categorize various speech enhancement tasks by either certain signals are removed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsDiffusion
