A Survey of Deep Learning Audio Generation Methods
Matej Bo\v{z}i\'c, Marko Horvat

TL;DR
This survey comprehensively reviews deep learning techniques for audio generation, covering audio representations, architectures, and evaluation metrics, aiming to guide beginners in understanding current state-of-the-art methods and future research directions.
Contribution
It provides a detailed overview of various deep learning architectures and audio representations used in audio generation, highlighting recent developments and evaluation methods.
Findings
Explains fundamental audio representations and recent developments.
Details various deep learning architectures like GANs, Transformers, Diffusion models.
Summarizes evaluation metrics used in audio generation.
Abstract
This article presents a review of typical techniques used in three distinct aspects of deep learning model development for audio generation. In the first part of the article, we provide an explanation of audio representations, beginning with the fundamental audio waveform. We then progress to the frequency domain, with an emphasis on the attributes of human hearing, and finally introduce a relatively recent development. The main part of the article focuses on explaining basic and extended deep learning architecture variants, along with their practical applications in the field of audio generation. The following architectures are addressed: 1) Autoencoders 2) Generative adversarial networks 3) Normalizing flows 4) Transformer networks 5) Diffusion models. Lastly, we will examine four distinct evaluation metrics that are commonly employed in audio generation. This article aims to offer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies
MethodsAttention Is All You Need · Softmax · Layer Normalization · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Label Smoothing · Normalizing Flows · Diffusion · Adam
