ReStyle-TTS: Relative and Continuous Style Control for Zero-Shot Speech Synthesis
Haitao Li, Chunxiang Jin, Chenglin Li, Wenhao Guan, Zhengxing Huang, Xie Chen

TL;DR
ReStyle-TTS introduces a novel framework for zero-shot speech synthesis that allows continuous, reference-relative control over multiple style attributes while maintaining speech quality and speaker identity.
Contribution
It proposes Decoupled Classifier-Free Guidance and style-specific LoRAs for effective, disentangled, and continuous style control in zero-shot TTS, addressing limitations of prior methods.
Findings
Enables continuous and relative control over pitch, energy, and emotions.
Maintains speaker timbre and intelligibility in style transfer.
Performs robustly with mismatched reference audio.
Abstract
Zero-shot text-to-speech models can clone a speaker's timbre from a short reference audio, but they also strongly inherit the speaking style present in the reference. As a result, synthesizing speech with a desired style often requires carefully selecting reference audio, which is impractical when only limited or mismatched references are available. While recent controllable TTS methods attempt to address this issue, they typically rely on absolute style targets and discrete textual prompts, and therefore do not support continuous and reference-relative style control. We propose ReStyle-TTS, a framework that enables continuous and reference-relative style control in zero-shot TTS. Our key insight is that effective style control requires first reducing the model's implicit dependence on reference style before introducing explicit control mechanisms. To this end, we introduce Decoupled…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
