PhraseVAE and PhraseLDM: Latent Diffusion for Full-Song Multitrack Symbolic Music Generation
Longshen Ou, Ye Wang

TL;DR
This paper introduces PhraseVAE and PhraseLDM, a novel latent diffusion framework for full-song multitrack symbolic music generation that overcomes sequence length limitations and produces coherent, diverse, and structured musical pieces efficiently.
Contribution
It presents the first latent diffusion model for full-song symbolic music, using a compact phrase-level latent space for high-fidelity reconstruction and efficient, non-autoregressive generation.
Findings
Supports up to 128 bars of music in a single pass
Generates complete songs within seconds with high quality
Maintains musical diversity and structural coherence
Abstract
This technical report presents a new paradigm for full-song symbolic music generation. Existing symbolic models operate on note-attribute tokens and suffer from extremely long sequences, limited context length, and weak support for long-range structure. We address these issues by introducing PhraseVAE and PhraseLDM, the first latent diffusion framework designed for full-song multitrack symbolic music. PhraseVAE compresses an arbitrary variable-length polyphonic note sequence into a single compact 64-dimensional phrase-level latent representation with high reconstruction fidelity, allowing a well-structured latent space and efficient generative modeling. Built on this latent space, PhraseLDM generates an entire multi-track song in a single pass without any autoregressive components. The system eliminates bar-wise sequential modeling, supports up to 128 bars of music (8 minutes at 64…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Artificial Intelligence in Games · Generative Adversarial Networks and Image Synthesis
