TL;DR
MambaVoiceCloning introduces a fully SSM-only diffusion-based TTS system that improves efficiency and quality by removing attention and recurrence layers, enabling better memory use and streaming capabilities.
Contribution
It presents a novel SSM-only conditioning approach for diffusion TTS, removing attention modules and demonstrating improved performance and deployability.
Findings
Achieves statistically reliable gains over existing models in MOS/CMOS, F0 RMSE, MCD, and WER.
Reduces encoder parameters to 21 million, increasing throughput by 1.6 times.
Improves memory footprint, stability, and deployability of diffusion TTS systems.
Abstract
MambaVoiceCloning (MVC) asks whether the conditioning path of diffusion-based TTS can be made fully SSM-only at inference, removing all attention and explicit RNN-style recurrence layers across text, rhythm, and prosody, while preserving or improving quality under controlled conditions. MVC combines a gated bidirectional Mamba text encoder, a Temporal Bi-Mamba supervised by a lightweight alignment teacher discarded after training, and an Expressive Mamba with AdaLN modulation, yielding linear-time O(T) conditioning with bounded activation memory and practical finite look-ahead streaming. Unlike prior Mamba-TTS systems that remain hybrid at inference, MVC removes attention-based duration and style modules under a fixed StyleTTS2 mel-diffusion-vocoder backbone. Trained on LJSpeech/LibriTTS and evaluated on VCTK, CSS10 (ES/DE/FR), and long-form Gutenberg passages, MVC achieves modest but…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
