MamTra: A Hybrid Mamba-Transformer Backbone for Speech Synthesis
Tan Dat Nguyen, Sangmin Bae, Joon Son Chung, Ji-Hoon Kim

TL;DR
MamTra is a hybrid Mamba-Transformer model for speech synthesis that combines efficiency and global context modeling, achieving high-quality results with reduced computational costs and less training data.
Contribution
The paper introduces MamTra, a novel hybrid Mamba-Transformer architecture with knowledge transfer strategies, enabling efficient speech synthesis without sacrificing quality.
Findings
Reduces inference VRAM by up to 34%.
Maintains speech fidelity with only 2% of original training data.
Outperforms existing models in efficiency and quality.
Abstract
Despite the remarkable quality of LLM-based text-to-speech systems, their reliance on autoregressive Transformers leads to quadratic computational complexity, which severely limits practical applications. Linear-time alternatives, notably Mamba, offer a potential remedy; however, they often sacrifice the global context essential for expressive synthesis. In this paper, we propose MamTra, an interleaved Mamba-Transformer framework designed to leverage the advantages of Mamba's efficiency and Transformers' modeling capability. We also introduce novel knowledge transfer strategies to distill insights from a pretrained Transformer into our hybrid architecture, thereby bypassing the prohibitive costs of training from scratch. Systematic experiments identify the optimal hybrid configuration, and demonstrate that MamTra reduces inference VRAM usage by up to 34% without compromising speech…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Face recognition and analysis
