AS-Speech: Adaptive Style For Speech Synthesis
Zhipeng Li, Xiaofen Xing, Jun Wang, Shuaiqi Chen, Guoqiao Yu, Guanglu, Wan, Xiangmin Xu

TL;DR
AS-Speech introduces a unified adaptive style framework for TTS that combines fine-grained timbre and rhythm features, resulting in more natural and speaker-similar synthesized speech.
Contribution
The paper presents a novel adaptive style method integrating timbre and rhythm into a single model for improved speech synthesis.
Findings
Produces more natural speech with higher fidelity.
Achieves better speaker similarity in style.
Outperforms existing adaptive TTS models.
Abstract
In recent years, there has been significant progress in Text-to-Speech (TTS) synthesis technology, enabling the high-quality synthesis of voices in common scenarios. In unseen situations, adaptive TTS requires a strong generalization capability to speaker style characteristics. However, the existing adaptive methods can only extract and integrate coarse-grained timbre or mixed rhythm attributes separately. In this paper, we propose AS-Speech, an adaptive style methodology that integrates the speaker timbre characteristics and rhythmic attributes into a unified framework for text-to-speech synthesis. Specifically, AS-Speech can accurately simulate style characteristics through fine-grained text-based timbre features and global rhythm information, and achieve high-fidelity speech synthesis through the diffusion model. Experiments show that the proposed model produces voices with higher…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Natural Language Processing Techniques
MethodsDiffusion
