Efficient Long-Sequence Diffusion Modeling for Symbolic Music Generation

Jinhan Xu; Xing Tang; Houpeng Yang; Haoran Zhang; Shenghua Yuan; Jiatao Chen; Tianming Xi; Jing Wang; Jiaojiao Yu; Guangli Xiang

arXiv:2603.00576·cs.SD·March 3, 2026

Efficient Long-Sequence Diffusion Modeling for Symbolic Music Generation

Jinhan Xu, Xing Tang, Houpeng Yang, Haoran Zhang, Shenghua Yuan, Jiatao Chen, Tianming Xi, Jing Wang, Jiaojiao Yu, Guangli Xiang

PDF

Open Access

TL;DR

This paper introduces SMDIM, an efficient diffusion model for symbolic music generation that captures long-range dependencies with near-linear cost and refines local details, outperforming existing methods in quality and efficiency.

Contribution

The paper proposes SMDIM, a novel diffusion strategy combining structured state space models and hybrid refinement for scalable long-sequence symbolic music generation.

Findings

01

SMDIM outperforms state-of-the-art models in quality and efficiency.

02

The model generalizes well across diverse musical styles.

03

It effectively captures long-range dependencies with near-linear computational cost.

Abstract

Symbolic music generation is a challenging task in multimedia generation, involving long sequences with hierarchical temporal structures, long-range dependencies, and fine-grained local details. Though recent diffusion-based models produce high quality generations, they tend to suffer from high training and inference costs with long symbolic sequences due to iterative denoising and sequence-length-related costs. To deal with such problem, we put forth a diffusing strategy named SMDIM to combine efficient global structure construction and light local refinement. SMDIM uses structured state space models to capture long range musical context at near linear cost, and selectively refines local musical details via a hybrid refinement scheme. Experiments performed on a wide range of symbolic music datasets which encompass various Western classical music, popular music and traditional folk…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Generative Adversarial Networks and Image Synthesis