Environment Aware Text-to-Speech Synthesis
Daxin Tan, Guangyan Zhang, Tan Lee

TL;DR
This paper introduces an environment-aware TTS system that models and incorporates acoustic environment factors to generate speech matching specific speaker and environment characteristics, leveraging heterogeneous speech data.
Contribution
It presents a novel neural network approach that disentangles speaker and environment factors in speech, enabling environment-aware speech synthesis from diverse data sources.
Findings
Effective disentanglement of speaker and environment factors.
Ability to synthesize speech with specified speaker and environment attributes.
Demonstrated improvements in speech quality and attribute control.
Abstract
This study aims at designing an environment-aware text-to-speech (TTS) system that can generate speech to suit specific acoustic environments. It is also motivated by the desire to leverage massive data of speech audio from heterogeneous sources in TTS system development. The key idea is to model the acoustic environment in speech audio as a factor of data variability and incorporate it as a condition in the process of neural network based speech synthesis. Two embedding extractors are trained with two purposely constructed datasets for characterization and disentanglement of speaker and environment factors in speech. A neural network model is trained to generate speech from extracted speaker and environment embeddings. Objective and subjective evaluation results demonstrate that the proposed TTS system is able to effectively disentangle speaker and environment factors and synthesize…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
