Training-Efficient Text-to-Music Generation with State-Space Modeling

Wei-Jaw Lee; Fang-Chih Hsieh; Xuanjun Chen; Fang-Duo Tsai; and Yi-Hsuan Yang

arXiv:2601.14786·cs.SD·January 22, 2026

Training-Efficient Text-to-Music Generation with State-Space Modeling

Wei-Jaw Lee, Fang-Chih Hsieh, Xuanjun Chen, Fang-Duo Tsai, and Yi-Hsuan Yang

PDF

Open Access

TL;DR

This paper introduces a training-efficient, open-source text-to-music generation model using state-space models, achieving competitive quality with significantly less data and computation than existing Transformer-based models.

Contribution

It demonstrates that state-space models can replace Transformers in TTM, offering superior training efficiency and comparable performance with less data and compute, and provides an open-source implementation.

Findings

01

SSMs outperform Transformers in training efficiency.

02

Achieves competitive performance with only 9% FLOPs and 2% data.

03

Maintains performance at smaller model sizes with same training budget.

Abstract

Recent advances in text-to-music generation (TTM) have yielded high-quality results, but often at the cost of extensive compute and the use of large proprietary internal data. To improve the affordability and openness of TTM training, an open-source generative model backbone that is more training- and data-efficient is needed. In this paper, we constrain the number of trainable parameters in the generative model to match that of the MusicGen-small benchmark (with about 300M parameters), and replace its Transformer backbone with the emerging class of state-space models (SSMs). Specifically, we explore different SSM variants for sequence modeling, and compare a single-stage SSM-based design with a decomposable two-stage SSM/diffusion hybrid design. All proposed models are trained from scratch on a purely public dataset comprising 457 hours of CC-licensed music, ensuring full openness. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Music and Audio Processing · Generative Adversarial Networks and Image Synthesis