TL;DR
ParaStyleTTS is a lightweight, interpretable TTS system that enables expressive style control from text prompts alone, achieving high-quality speech with robustness, efficiency, and real-time applicability, surpassing LLM-based methods in speed and resource usage.
Contribution
It introduces a novel two-level style adaptation architecture for controllable, robust, and efficient expressive TTS from text prompts without relying on reference audio or large language models.
Findings
Generates high-quality speech comparable to state-of-the-art LLM-based systems.
Operates 30x faster with 8x fewer parameters and less memory.
Exhibits superior robustness and controllability over paralinguistic styles.
Abstract
Controlling speaking style in text-to-speech (TTS) systems has become a growing focus in both academia and industry. While many existing approaches rely on reference audio to guide style generation, such methods are often impractical due to privacy concerns and limited accessibility. More recently, large language models (LLMs) have been used to control speaking style through natural language prompts; however, their high computational cost, lack of interpretability, and sensitivity to prompt phrasing limit their applicability in real-time and resource-constrained environments. In this work, we propose ParaStyleTTS, a lightweight and interpretable TTS framework that enables expressive style control from text prompts alone. ParaStyleTTS features a novel two-level style adaptation architecture that separates prosodic and paralinguistic speech style modeling. It allows fine-grained and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
