Incremental Disentanglement for Environment-Aware Zero-Shot   Text-to-Speech Synthesis

Ye-Xin Lu; Hui-Peng Du; Zheng-Yan Sheng; Yang Ai; Zhen-Hua Ling

arXiv:2412.16977·eess.AS·December 24, 2024·ICASSP

Incremental Disentanglement for Environment-Aware Zero-Shot Text-to-Speech Synthesis

Ye-Xin Lu, Hui-Peng Du, Zheng-Yan Sheng, Yang Ai, Zhen-Hua Ling

PDF

Open Access

TL;DR

This paper introduces IDEA-TTS, a novel zero-shot TTS method that effectively disentangles environment, speaker, and text factors to synthesize high-quality, environment-aware speech for unseen speakers, with state-of-the-art results.

Contribution

The paper proposes an incremental disentanglement process for environment-aware zero-shot TTS, integrating environment and speaker embeddings for improved speech synthesis.

Findings

01

Superior speech quality and similarity in environment-aware TTS

02

Effective disentanglement of environment, speaker, and text factors

03

State-of-the-art performance in environment conversion tasks

Abstract

This paper proposes an Incremental Disentanglement-based Environment-Aware zero-shot text-to-speech (TTS) method, dubbed IDEA-TTS, that can synthesize speech for unseen speakers while preserving the acoustic characteristics of a given environment reference speech. IDEA-TTS adopts VITS as the TTS backbone. To effectively disentangle the environment, speaker, and text factors, we propose an incremental disentanglement process, where an environment estimator is designed to first decompose the environmental spectrogram into an environment mask and an enhanced spectrogram. The environment mask is then processed by an environment encoder to extract environment embeddings, while the enhanced spectrogram facilitates the subsequent disentanglement of the speaker and text factors with the condition of the speaker embeddings, which are extracted from the environmental speech using a pretrained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing