Compact Neural TTS Voices for Accessibility
Kunal Jain, Eoin Murphy, Deepanshu Gupta, Jonathan Dyke, Saumya Shah,, Vasilieios Tsiaras, Petko Petkov, Alistair Conkie

TL;DR
This paper presents a compact neural TTS system that achieves high-quality speech synthesis with very low latency and small disk footprint, suitable for deployment on low-power devices for accessibility.
Contribution
A novel compact neural TTS model that balances naturalness, latency, and size, enabling real-time speech synthesis on resource-constrained devices.
Findings
Achieves approximately 15 ms latency.
Maintains high speech naturalness.
Has a low disk footprint suitable for handheld devices.
Abstract
Contemporary text-to-speech solutions for accessibility applications can typically be classified into two categories: (i) device-based statistical parametric speech synthesis (SPSS) or unit selection (USEL) and (ii) cloud-based neural TTS. SPSS and USEL offer low latency and low disk footprint at the expense of naturalness and audio quality. Cloud-based neural TTS systems provide significantly better audio quality and naturalness but regress in terms of latency and responsiveness, rendering these impractical for real-world applications. More recently, neural TTS models were made deployable to run on handheld devices. Nevertheless, latency remains higher than SPSS and USEL, while disk footprint prohibits pre-installation for multiple voices at once. In this work, we describe a high-quality compact neural TTS system achieving latency on the order of 15 ms with low disk footprint. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTactile and Sensory Interactions
