MamTra: A Hybrid Mamba-Transformer Backbone for Speech Synthesis

Tan Dat Nguyen; Sangmin Bae; Joon Son Chung; Ji-Hoon Kim

arXiv:2603.12342·eess.AS·March 16, 2026

MamTra: A Hybrid Mamba-Transformer Backbone for Speech Synthesis

Tan Dat Nguyen, Sangmin Bae, Joon Son Chung, Ji-Hoon Kim

PDF

Open Access

TL;DR

MamTra is a hybrid Mamba-Transformer model for speech synthesis that combines efficiency and global context modeling, achieving high-quality results with reduced computational costs and less training data.

Contribution

The paper introduces MamTra, a novel hybrid Mamba-Transformer architecture with knowledge transfer strategies, enabling efficient speech synthesis without sacrificing quality.

Findings

01

Reduces inference VRAM by up to 34%.

02

Maintains speech fidelity with only 2% of original training data.

03

Outperforms existing models in efficiency and quality.

Abstract

Despite the remarkable quality of LLM-based text-to-speech systems, their reliance on autoregressive Transformers leads to quadratic computational complexity, which severely limits practical applications. Linear-time alternatives, notably Mamba, offer a potential remedy; however, they often sacrifice the global context essential for expressive synthesis. In this paper, we propose MamTra, an interleaved Mamba-Transformer framework designed to leverage the advantages of Mamba's efficiency and Transformers' modeling capability. We also introduce novel knowledge transfer strategies to distill insights from a pretrained Transformer into our hybrid architecture, thereby bypassing the prohibitive costs of training from scratch. Systematic experiments identify the optimal hybrid configuration, and demonstrate that MamTra reduces inference VRAM usage by up to 34% without compromising speech…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Face recognition and analysis