T-Mimi: A Transformer-based Mimi Decoder for Real-Time On-Phone TTS

Haibin Wu; Bach Viet Do; Naveen Suda; Julian Chan; Madhavan C R; Gene-Ping Yang; Yi-Chiao Wu; Naoyuki Kanda; Yossef Adi; Xin Lei; Yue Liu; Florian Metze; Yuzong Liu

arXiv:2601.20094·eess.AS·January 29, 2026

T-Mimi: A Transformer-based Mimi Decoder for Real-Time On-Phone TTS

Haibin Wu, Bach Viet Do, Naveen Suda, Julian Chan, Madhavan C R, Gene-Ping Yang, Yi-Chiao Wu, Naoyuki Kanda, Yossef Adi, Xin Lei, Yue Liu, Florian Metze, Yuzong Liu

PDF

Open Access

TL;DR

This paper introduces T-Mimi, a transformer-only decoder for real-time on-phone TTS that significantly reduces latency and maintains quality through careful quantization strategies.

Contribution

T-Mimi replaces convolutional components with a transformer-based decoder, achieving over 9x latency reduction on edge devices while preserving audio quality.

Findings

01

Latency reduced from 42.1ms to 4.4ms on mobile devices.

02

Quantization sensitivity identified in the last transformer and linear layers.

03

Full precision needed for certain layers to maintain audio quality.

Abstract

Neural audio codecs provide promising acoustic features for speech synthesis, with representative streaming codecs like Mimi providing high-quality acoustic features for real-time Text-to-Speech (TTS) applications. However, Mimi's decoder, which employs a hybrid transformer and convolution architecture, introduces significant latency bottlenecks on edge devices due to the the compute intensive nature of deconvolution layers which are not friendly for mobile-CPUs, such as the most representative framework XNNPACK. This paper introduces T-Mimi, a novel modification of the Mimi codec decoder that replaces its convolutional components with a purely transformer-based decoder, inspired by the TS3-Codec architecture. This change dramatically reduces on-device TTS latency from 42.1ms to just 4.4ms. Furthermore, we conduct quantization aware training and derive a crucial finding: the final two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Advanced Data Compression Techniques · Speech and Audio Processing