The Power of Fragmentation: A Hierarchical Transformer Model for Structural Segmentation in Symbolic Music Generation
Guowei Wu, Shipei Liu, Xiaoya Fan

TL;DR
This paper introduces a hierarchical Transformer model for symbolic music generation that captures multi-scale structural elements like sections and chords, leading to more realistic and stylistically consistent music.
Contribution
It proposes a novel hierarchical Transformer with a Fragment Scope Localization layer and multi-scale attention for better structural understanding in music generation.
Findings
Outperforms current state-of-the-art models in quantitative metrics.
Produces more realistic and melody-reuse music according to visual evaluation.
Achieves consistent style across generated sections with Music Style Normalization.
Abstract
Symbolic Music Generation relies on the contextual representation capabilities of the generative model, where the most prevalent approach is the Transformer-based model. The learning of musical context is also related to the structural elements in music, i.e. intro, verse, and chorus, which are currently overlooked by the research community. In this paper, we propose a hierarchical Transformer model to learn multi-scale contexts in music. In the encoding phase, we first designed a Fragment Scope Localization layer to syncopate the music into chords and sections. Then, we use a multi-scale attention mechanism to learn note-, chord-, and section-level contexts. In the decoding phase, we proposed a hierarchical Transformer model that uses fine-decoders to generate sections in parallel and a coarse-decoder to decode the combined music. We also designed a Music Style Normalization layer to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Neuroscience and Music Perception
MethodsAttention Is All You Need · Linear Layer · Dense Connections · Position-Wise Feed-Forward Layer · Adam · VERtex Similarity Embeddings · Byte Pair Encoding · Residual Connection · Label Smoothing · Absolute Position Encodings
