Long-form music generation with latent diffusion

Zach Evans; Julian D. Parker; CJ Carr; Zack Zukowski; Josiah Taylor,; Jordi Pons

arXiv:2404.10301·cs.SD·July 30, 2024·5 cites

Long-form music generation with latent diffusion

Zach Evans, Julian D. Parker, CJ Carr, Zack Zukowski, Josiah Taylor,, Jordi Pons

PDF

Open Access 1 Repo

TL;DR

This paper introduces a latent diffusion model capable of generating long-form, coherent music tracks up to nearly five minutes, achieving state-of-the-art audio quality and prompt alignment through training on long temporal contexts.

Contribution

It presents a novel diffusion-transformer model operating on a downsampled latent space for long-form music generation from text prompts.

Findings

01

Achieves up to 4 minutes 45 seconds of coherent music

02

Outperforms previous models on audio quality and prompt alignment metrics

03

Subjective tests confirm the coherence of full-length generated music

Abstract

Audio-based generative models for music have seen great strides recently, but so far have not managed to produce full-length music tracks with coherent musical structure from text prompts. We show that by training a generative model on long temporal contexts it is possible to produce long-form music of up to 4m45s. Our model consists of a diffusion-transformer operating on a highly downsampled continuous latent representation (latent rate of 21.5Hz). It obtains state-of-the-art generations according to metrics on audio quality and prompt alignment, and subjective tests reveal that it produces full-length music with coherent structure.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

stability-ai/stable-audio-tools
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Music and Audio Processing