Self-Supervised Disentanglement of Harmonic and Rhythmic Features in Music Audio Signals
Yiming Wu

TL;DR
This paper introduces a self-supervised deep learning approach using a variational autoencoder to disentangle rhythmic and harmonic features in music audio, enabling controllable music generation and remixing.
Contribution
It presents a novel self-supervised method for disentangling rhythmic and harmonic features in music audio using vector rotation in latent space within a variational autoencoder.
Findings
Effective disentanglement of rhythmic and harmonic features demonstrated
Improved controllable music remixing capabilities shown
Quantitative evaluation confirms feature separation
Abstract
The aim of latent variable disentanglement is to infer the multiple informative latent representations that lie behind a data generation process and is a key factor in controllable data generation. In this paper, we propose a deep neural network-based self-supervised learning method to infer the disentangled rhythmic and harmonic representations behind music audio generation. We train a variational autoencoder that generates an audio mel-spectrogram from two latent features representing the rhythmic and harmonic content. In the training phase, the variational autoencoder is trained to reconstruct the input mel-spectrogram given its pitch-shifted version. At each forward computation in the training phase, a vector rotation operation is applied to one of the latent features, assuming that the dimensions of the feature vectors are related to pitch intervals. Therefore, in the trained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies
