Generative timbre spaces: regularizing variational auto-encoders with perceptual metrics
Philippe Esling, Axel Chemla--Romeu-Santos, Adrien Bitton

TL;DR
This paper introduces a novel approach using regularized Variational Auto-Encoders to create perceptually meaningful, invertible, and generative timbre spaces that can synthesize and analyze novel musical instruments.
Contribution
It adapts VAEs with perceptual regularization to produce continuous, invertible timbre spaces aligned with human perception, enabling synthesis and analysis of new instruments.
Findings
NSGT provides the best correlation with timbre spaces.
The model generalizes to novel instruments.
Descriptors evolve smoothly along latent dimensions.
Abstract
Timbre spaces have been used in music perception to study the perceptual relationships between instruments based on dissimilarity ratings. However, these spaces do not generalize to novel examples and do not provide an invertible mapping, preventing audio synthesis. In parallel, generative models have aimed to provide methods for synthesizing novel timbres. However, these systems do not provide an understanding of their inner workings and are usually not related to any perceptually relevant information. Here, we show that Variational Auto-Encoders (VAE) can alleviate all of these limitations by constructing generative timbre spaces. To do so, we adapt VAEs to learn an audio latent space, while using perceptual ratings from timbre studies to regularize the organization of this space. The resulting space allows us to analyze novel instruments, while being able to synthesize audio from any…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Generative Adversarial Networks and Image Synthesis · Music Technology and Sound Studies
