Audio representations for deep learning in sound synthesis: A review
Anastasia Natsiou, Sean O'Leary

TL;DR
This review paper discusses various audio representations used in deep learning-based sound synthesis, highlighting how different representations influence model architecture choices, training efficiency, and sound quality evaluation.
Contribution
It provides a comprehensive overview of audio representations and their impact on deep learning sound synthesis architectures and evaluation methods.
Findings
Different audio representations affect model complexity and training time.
Transformations like feature extraction improve efficiency and perceptual relevance.
Evaluation metrics vary depending on the audio representation used.
Abstract
The rise of deep learning algorithms has led many researchers to withdraw from using classic signal processing methods for sound generation. Deep learning models have achieved expressive voice synthesis, realistic sound textures, and musical notes from virtual instruments. However, the most suitable deep learning architecture is still under investigation. The choice of architecture is tightly coupled to the audio representations. A sound's original waveform can be too dense and rich for deep learning models to deal with efficiently - and complexity increases training time and computational cost. Also, it does not represent sound in the manner in which it is perceived. Therefore, in many cases, the raw audio has been transformed into a compressed and more meaningful form using upsampling, feature-extraction, or even by adopting a higher level illustration of the waveform. Furthermore,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing
