DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through   Dual Classifier-Free Guidance

Jinhyeok Yang; Junhyeok Lee; Hyeong-Seok Choi; Seunghun Ji; Hyeongju; Kim; Juheon Lee

arXiv:2408.14423·eess.AS·August 28, 2024

DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free Guidance

Jinhyeok Yang, Junhyeok Lee, Hyeong-Seok Choi, Seunghun Ji, Hyeongju, Kim, Juheon Lee

PDF

Open Access

TL;DR

DualSpeech is a novel TTS model that uses phoneme-level latent diffusion and dual classifier-free guidance to achieve superior control over speaker-fidelity and text-intelligibility, surpassing existing models.

Contribution

It introduces a new TTS framework combining phoneme-level latent diffusion with dual classifier-free guidance for enhanced speech control.

Findings

01

Outperforms state-of-the-art TTS models in quality.

02

Provides exceptional control over speaker and text attributes.

03

Demonstrates significant improvements through experimental evaluation.

Abstract

Text-to-Speech (TTS) models have advanced significantly, aiming to accurately replicate human speech's diversity, including unique speaker identities and linguistic nuances. Despite these advancements, achieving an optimal balance between speaker-fidelity and text-intelligibility remains a challenge, particularly when diverse control demands are considered. Addressing this, we introduce DualSpeech, a TTS model that integrates phoneme-level latent diffusion with dual classifier-free guidance. This approach enables exceptional control over speaker-fidelity and text-intelligibility. Experimental results demonstrate that by utilizing the sophisticated control, DualSpeech surpasses existing state-of-the-art TTS models in performance. Demos are available at https://bit.ly/48Ewoib.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems

MethodsDiffusion