Exploring State-Space-Model based Language Model in Music Generation
Wei-Jaw Lee, Fang-Chih Hsieh, Xuanjun Chen, Fang-Duo Tsai, Yi-Hsuan Yang

TL;DR
This paper investigates the use of Mamba-based state space models for text-to-music generation, demonstrating faster convergence and comparable quality to Transformers in limited-resource settings.
Contribution
It introduces a novel application of Mamba-based architectures, specifically SiMBA, for sequence modeling in music generation, showing advantages over traditional Transformers.
Findings
SiMBA converges faster than Transformers in limited-resource scenarios.
Single-codebook RVQ captures semantic information in music.
SiMBA generates outputs closer to ground truth under resource constraints.
Abstract
The recent surge in State Space Models (SSMs), particularly the emergence of Mamba, has established them as strong alternatives or complementary modules to Transformers across diverse domains. In this work, we aim to explore the potential of Mamba-based architectures for text-to-music generation. We adopt discrete tokens of Residual Vector Quantization (RVQ) as the modeling representation and empirically find that a single-layer codebook can capture semantic information in music. Motivated by this observation, we focus on modeling a single-codebook representation and adapt SiMBA, originally designed as a Mamba-based encoder, to function as a decoder for sequence modeling. We compare its performance against a standard Transformer-based decoder. Our results suggest that, under limited-resource settings, SiMBA achieves much faster convergence and generates outputs closer to the ground…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Speech Recognition and Synthesis · Music and Audio Processing
