DS-TTS: Zero-Shot Speaker Style Adaptation from Voice Clips via Dynamic Dual-Style Feature Modulation

Ming Meng; Ziyi Yang; Jian Yang; Zhenjie Su; Yonggui Zhu; Zhaoxin Fan

arXiv:2506.01020·cs.SD·June 3, 2025

DS-TTS: Zero-Shot Speaker Style Adaptation from Voice Clips via Dynamic Dual-Style Feature Modulation

Ming Meng, Ziyi Yang, Jian Yang, Zhenjie Su, Yonggui Zhu, Zhaoxin Fan

PDF

Open Access

TL;DR

This paper introduces DS-TTS, a novel zero-shot TTS system that uses dual-style encoding and dynamic adaptation to synthesize natural, expressive speech for unseen speakers from minimal voice samples.

Contribution

We propose DS-TTS with a Dual-Style Encoding Network and Style Gating-Film mechanism, advancing zero-shot voice cloning by better capturing speaker style and handling variable sentence lengths.

Findings

01

Outperforms existing models in word error rate and speaker similarity

02

Demonstrates robust generalization to unseen speakers

03

Improves naturalness and expressiveness of synthesized speech

Abstract

Recent advancements in text-to-speech (TTS) technology have increased demand for personalized audio synthesis. Zero-shot voice cloning, a specialized TTS task, aims to synthesize a target speaker's voice using only a single audio sample and arbitrary text, without prior exposure to the speaker during training. This process employs pattern recognition techniques to analyze and replicate the speaker's unique vocal features. Despite progress, challenges remain in adapting to the vocal style of unseen speakers, highlighting difficulties in generalizing TTS systems to handle diverse voices while maintaining naturalness, expressiveness, and speaker fidelity. To address the challenges of unseen speaker style adaptation, we propose DS-TTS, a novel approach aimed at enhancing the synthesis of diverse, previously unheard voices. Central to our method is a Dual-Style Encoding Network (DuSEN),…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Data Compression Techniques