Interactive Text-to-Speech System via Joint Style Analysis
Yang Gao, Weiyi Zheng, Zhaojun Yang, Thilo Kohler, Christian Fuegen,, Qing He

TL;DR
This paper introduces a style-embedded TTS system that generates speech responses matching the style of input queries by jointly training style extraction and TTS models using semi-supervised learning, improving naturalness and style consistency.
Contribution
The paper presents a novel semi-supervised approach for joint style extraction and TTS training, enabling style-aware speech synthesis with limited labeled data.
Findings
Users preferred styled TTS responses in subjective tests.
The system successfully mimics speech query styles during inference.
Joint training improves style consistency in generated speech.
Abstract
While modern TTS technologies have made significant advancements in audio quality, there is still a lack of behavior naturalness compared to conversing with people. We propose a style-embedded TTS system that generates styled responses based on the speech query style. To achieve this, the system includes a style extraction model that extracts a style embedding from the speech query, which is then used by the TTS to produce a matching response. We faced two main challenges: 1) only a small portion of the TTS training dataset has style labels, which is needed to train a multi-style TTS that respects different style embeddings during inference. 2) The TTS system and the style extraction model have disjoint training datasets. We need consistent style labels across these two datasets so that the TTS can learn to respect the labels produced by the style extraction model during inference. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
