SATTS: Speaker Attractor Text to Speech, Learning to Speak by Learning   to Separate

Nabarun Goswami; Tatsuya Harada

arXiv:2207.06011·eess.AS·July 14, 2022·1 cites

SATTS: Speaker Attractor Text to Speech, Learning to Speak by Learning to Separate

Nabarun Goswami, Tatsuya Harada

PDF

Open Access

TL;DR

SATTS introduces a novel TTS method leveraging speaker attractors for zero-shot speaker adaptation, enabling natural speech synthesis for unseen speakers even with suboptimal reference recordings.

Contribution

This work pioneers the use of speaker attractors for zero-shot multi-speaker TTS, enhancing speaker adaptation without extensive training data.

Findings

01

SATTS can synthesize natural speech for unseen speakers.

02

Effective even with reverberant or mixed reference signals.

03

Demonstrates improved speaker adaptation in TTS systems.

Abstract

The mapping of text to speech (TTS) is non-deterministic, letters may be pronounced differently based on context, or phonemes can vary depending on various physiological and stylistic factors like gender, age, accent, emotions, etc. Neural speaker embeddings, trained to identify or verify speakers are typically used to represent and transfer such characteristics from reference speech to synthesized speech. Speech separation on the other hand is the challenging task of separating individual speakers from an overlapping mixed signal of various speakers. Speaker attractors are high-dimensional embedding vectors that pull the time-frequency bins of each speaker's speech towards themselves while repelling those belonging to other speakers. In this work, we explore the possibility of using these powerful speaker attractors for zero-shot speaker adaptation in multi-speaker TTS synthesis and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing