TL;DR
LatentFT introduces a frequency-domain control method for generative music models, enabling intuitive manipulation of musical structures at different timescales in latent space.
Contribution
It combines diffusion autoencoders with a latent Fourier transform to allow coherent manipulation of musical patterns by timescale, enhancing interpretability and control.
Findings
LatentFT improves condition adherence and quality over baselines.
It enables musical variation and blending by manipulating latent frequencies.
Different musical attributes are localized in distinct latent spectrum regions.
Abstract
We introduce the Latent Fourier Transform (LatentFT), a framework that provides novel frequency-domain controls for generative music models. LatentFT combines a diffusion autoencoder with a latent-space Fourier transform to separate musical patterns by timescale. By masking latents in the frequency domain during training, our method yields representations that can be manipulated coherently at inference. This allows us to generate musical variations and blends from reference examples while preserving characteristics at desired timescales, which are specified as frequencies in the latent space. LatentFT parallels the role of the equalizer in music production: while traditional equalizers operates on audible frequencies to shape timbre, LatentFT operates on latent-space frequencies to shape musical structure. Experiments and listening tests show that LatentFT improves condition adherence…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
