DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech

Sen Liu; Yiwei Guo; Chenpeng Du; Xie Chen; Kai Yu

arXiv:2306.14145·cs.SD·June 27, 2023

DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech

Sen Liu, Yiwei Guo, Chenpeng Du, Xie Chen, Kai Yu

PDF

Open Access

TL;DR

This paper introduces DSE-TTS, a dual speaker embedding framework for cross-lingual TTS that improves speaker similarity and nativeness by leveraging distinct embeddings for style and timbre.

Contribution

It proposes a novel dual embedding approach that separates linguistic style and speaker timbre, enhancing cross-lingual speech synthesis quality.

Findings

01

Outperforms SANE-TTS in cross-lingual synthesis

02

Improves nativeness and speaker similarity

03

Utilizes VQ acoustic features to reduce speaker info

Abstract

Although high-fidelity speech can be obtained for intralingual speech synthesis, cross-lingual text-to-speech (CTTS) is still far from satisfactory as it is difficult to accurately retain the speaker timbres(i.e. speaker similarity) and eliminate the accents from their first language(i.e. nativeness). In this paper, we demonstrated that vector-quantized(VQ) acoustic feature contains less speaker information than mel-spectrogram. Based on this finding, we propose a novel dual speaker embedding TTS (DSE-TTS) framework for CTTS with authentic speaking style. Here, one embedding is fed to the acoustic model to learn the linguistic speaking style, while the other one is integrated into the vocoder to mimic the target speaker's timbre. Experiments show that by combining both embeddings, DSE-TTS significantly outperforms the state-of-the-art SANE-TTS in cross-lingual synthesis, especially in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling