Evaluating Latent Space Structure in Timbre VAEs: A Comparative Study of Unsupervised, Descriptor-Conditioned, and Perceptual Feature-Conditioned Models
Joseph Cameron, Alan Blackwell

TL;DR
This study compares three types of VAEs for musical timbre generation, showing that conditioning on perceptual features creates more interpretable and pitch-invariant latent spaces than unsupervised or descriptor-conditioned models.
Contribution
It introduces a comprehensive evaluation framework for timbre VAE latent spaces and demonstrates the advantages of perceptual feature conditioning over other methods.
Findings
Perceptual feature conditioning improves latent space compactness.
Perceptual conditioned models show better pitch-invariance.
Descriptor conditioning has limitations in interpretability.
Abstract
We present a comparative evaluation of latent space organization in three Variational Autoencoders (VAEs) for musical timbre generation: an unsupervised VAE, a descriptor-conditioned VAE, and a VAE conditioned on continuous perceptual features from the AudioCommons timbral models. Using a curated dataset of electric guitar sounds labeled with 19 semantic descriptors across four intensity levels, we assess each model's latent structure with a suite of clustering and interpretability metrics. These include silhouette scores, timbre descriptor compactness, pitch-conditional separation, trajectory linearity, and cross-pitch consistency. Our findings show that conditioning on perceptual features yields a more compact, discriminative, and pitch-invariant latent space, outperforming both the unsupervised and discrete descriptor-conditioned models. This work highlights the limitations of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Music and Audio Processing · Neuroscience and Music Perception
