PhraseVAE and PhraseLDM: Latent Diffusion for Full-Song Multitrack Symbolic Music Generation

Longshen Ou; Ye Wang

arXiv:2512.11348·cs.SD·December 17, 2025

PhraseVAE and PhraseLDM: Latent Diffusion for Full-Song Multitrack Symbolic Music Generation

Longshen Ou, Ye Wang

PDF

Open Access

TL;DR

This paper introduces PhraseVAE and PhraseLDM, a novel latent diffusion framework for full-song multitrack symbolic music generation that overcomes sequence length limitations and produces coherent, diverse, and structured musical pieces efficiently.

Contribution

It presents the first latent diffusion model for full-song symbolic music, using a compact phrase-level latent space for high-fidelity reconstruction and efficient, non-autoregressive generation.

Findings

01

Supports up to 128 bars of music in a single pass

02

Generates complete songs within seconds with high quality

03

Maintains musical diversity and structural coherence

Abstract

This technical report presents a new paradigm for full-song symbolic music generation. Existing symbolic models operate on note-attribute tokens and suffer from extremely long sequences, limited context length, and weak support for long-range structure. We address these issues by introducing PhraseVAE and PhraseLDM, the first latent diffusion framework designed for full-song multitrack symbolic music. PhraseVAE compresses an arbitrary variable-length polyphonic note sequence into a single compact 64-dimensional phrase-level latent representation with high reconstruction fidelity, allowing a well-structured latent space and efficient generative modeling. Built on this latent space, PhraseLDM generates an entire multi-track song in a single pass without any autoregressive components. The system eliminates bar-wise sequential modeling, supports up to 128 bars of music (8 minutes at 64…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Artificial Intelligence in Games · Generative Adversarial Networks and Image Synthesis