Whole-Song Hierarchical Generation of Symbolic Music Using Cascaded Diffusion Models
Ziyu Wang, Lejun Min, Gus Xia

TL;DR
This paper introduces a hierarchical diffusion model for whole-song symbolic music generation, capturing global structure and local details, enabling controllable and high-quality full-piece music synthesis.
Contribution
It presents the first hierarchical modeling approach for full-song generation using cascaded diffusion models conditioned on different levels of musical semantics.
Findings
Generated music exhibits recognizable verse-chorus structure.
Music quality surpasses baseline models.
Model allows flexible control over musical features.
Abstract
Recent deep music generation studies have put much emphasis on long-term generation with structures. However, we are yet to see high-quality, well-structured whole-song generation. In this paper, we make the first attempt to model a full music piece under the realization of compositional hierarchy. With a focus on symbolic representations of pop songs, we define a hierarchical language, in which each level of hierarchy focuses on the semantics and context dependency at a certain music scope. The high-level languages reveal whole-song form, phrase, and cadence, whereas the low-level languages focus on notes, chords, and their local patterns. A cascaded diffusion model is trained to model the hierarchical language, where each level is conditioned on its upper levels. Experiments and analysis show that our model is capable of generating full-piece music with recognizable global…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Neuroscience and Music Perception
MethodsFocus · Diffusion
