VoiceLens: Controllable Speaker Generation and Editing with Flow

Yao Shi; Ming Li

arXiv:2309.14094·cs.SD·September 26, 2023

VoiceLens: Controllable Speaker Generation and Editing with Flow

Yao Shi, Ming Li

PDF

Open Access

TL;DR

VoiceLens introduces a flow-based method for controllable speaker generation and editing in speech synthesis, enabling flexible attribute manipulation and noise reduction without retraining TTS models.

Contribution

It presents a semi-supervised flow-based approach that models speaker embeddings for improved controllability and attribute editing in multi-speaker speech synthesis.

Findings

01

Comparable to Tacospawn in unconditional generation

02

Higher controllability and flexibility in conditional generation

03

Effective noise reduction via embedding editing

Abstract

Currently, many multi-speaker speech synthesis and voice conversion systems address speaker variations with an embedding vector. Modeling it directly allows new voices outside of training data to be synthesized. GMM based approaches such as Tacospawn are favored in literature for this generation task, but there are still some limitations when difficult conditionings are involved. In this paper, we propose VoiceLens, a semi-supervised flow-based approach, to model speaker embedding distributions for multi-conditional speaker generation. VoiceLens maps speaker embeddings into a combination of independent attributes and residual information. It allows new voices associated with certain attributes to be \textit{generated} for existing TTS models, and attributes of known voices to be meaningfully \textit{edited}. We show in this paper, VoiceLens displays an unconditional generation capacity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques