Nested Music Transformer: Sequentially Decoding Compound Tokens in Symbolic Music and Audio Generation

HaeJun Yoo; Hao-Wen Dong; Jongmin Jung; Dasaem Jeong

arXiv:2408.01180·cs.SD·March 17, 2026

Nested Music Transformer: Sequentially Decoding Compound Tokens in Symbolic Music and Audio Generation

HaeJun Yoo, Hao-Wen Dong, Jongmin Jung, Dasaem Jeong

PDF

Open Access 1 Repo

TL;DR

The paper introduces the Nested Music Transformer, an autoregressive model that decodes compound tokens in symbolic music and audio, improving sequence modeling efficiency and performance by capturing sub-token interdependencies.

Contribution

It presents a novel nested transformer architecture for autoregressive decoding of compound tokens, reducing memory usage while enhancing modeling of sub-token relationships.

Findings

01

Improved perplexity on symbolic music datasets

02

Enhanced modeling of sub-token interdependencies

03

Efficient decoding of compound tokens in audio and music

Abstract

Representing symbolic music with compound tokens, where each token consists of several different sub-tokens representing a distinct musical feature or attribute, offers the advantage of reducing sequence length. While previous research has validated the efficacy of compound tokens in music sequence modeling, predicting all sub-tokens simultaneously can lead to suboptimal results as it may not fully capture the interdependencies between them. We introduce the Nested Music Transformer (NMT), an architecture tailored for decoding compound tokens autoregressively, similar to processing flattened tokens, but with low memory usage. The NMT consists of two transformers: the main decoder that models a sequence of compound tokens and the sub-decoder for modeling sub-tokens of each compound token. The experiment results showed that applying the NMT to compound tokens can enhance the performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

judejiwoo/nmt
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing

MethodsLinear Layer · Residual Connection · Multi-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Softmax · Absolute Position Encodings · Dense Connections