Is Disentanglement enough? On Latent Representations for Controllable Music Generation
Ashis Pati, Alexander Lerch

TL;DR
This paper investigates whether disentangled latent representations in VAEs are sufficient for controllable music generation, highlighting the importance of decoder structure and proposing evaluation metrics.
Contribution
The study systematically analyzes the link between disentanglement and controllability in VAEs, emphasizing the decoder's role and introducing new evaluation methods.
Findings
High disentanglement does not guarantee controllability without a strong decoder.
Latent space structure significantly influences attribute manipulation.
Proposed metrics effectively evaluate controllability in latent spaces.
Abstract
Improving controllability or the ability to manipulate one or more attributes of the generated data has become a topic of interest in the context of deep generative models of music. Recent attempts in this direction have relied on learning disentangled representations from data such that the underlying factors of variation are well separated. In this paper, we focus on the relationship between disentanglement and controllability by conducting a systematic study using different supervised disentanglement learning algorithms based on the Variational Auto-Encoder (VAE) architecture. Our experiments show that a high degree of disentanglement can be achieved by using different forms of supervision to train a strong discriminative encoder. However, in the absence of a strong generative decoder, disentanglement does not necessarily imply controllability. The structure of the latent space with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Generative Adversarial Networks and Image Synthesis
