MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot   Text-to-Speech

Shengpeng Ji; Ziyue Jiang; Hanting Wang; Jialong Zuo; Zhou Zhao

arXiv:2402.09378·eess.AS·June 4, 2024·1 cites

MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech

Shengpeng Ji, Ziyue Jiang, Hanting Wang, Jialong Zuo, Zhou Zhao

PDF

Open Access 1 Video

TL;DR

MobileSpeech is a novel, fast, lightweight, and robust zero-shot TTS framework optimized for mobile devices, achieving high speech quality and inference speed with state-of-the-art results.

Contribution

It introduces a mobile-friendly zero-shot TTS system with a novel speech mask decoder and probabilistic masking, enabling real-time high-quality speech synthesis on mobile devices.

Findings

01

Achieves RTF of 0.09 on A100 GPU

02

Demonstrates effective multilingual speech synthesis

03

Successfully deployed on mobile devices

Abstract

Zero-shot text-to-speech (TTS) has gained significant attention due to its powerful voice cloning capabilities, requiring only a few seconds of unseen speaker voice prompts. However, all previous work has been developed for cloud-based systems. Taking autoregressive models as an example, although these approaches achieve high-fidelity voice cloning, they fall short in terms of inference speed, model size, and robustness. Therefore, we propose MobileSpeech, which is a fast, lightweight, and robust zero-shot text-to-speech system based on mobile devices for the first time. Specifically: 1) leveraging discrete codec, we design a parallel speech mask decoder module called SMD, which incorporates hierarchical information from the speech codec and weight mechanisms across different codec layers during the generation process. Moreover, to bridge the gap between text and speech, we introduce a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings