Mid-attribute speaker generation using optimal-transport-based   interpolation of Gaussian mixture models

Aya Watanabe; Shinnosuke Takamichi; Yuki Saito; Detai Xin; Hiroshi; Saruwatari

arXiv:2210.09916·cs.SD·October 19, 2022

Mid-attribute speaker generation using optimal-transport-based interpolation of Gaussian mixture models

Aya Watanabe, Shinnosuke Takamichi, Yuki Saito, Detai Xin, Hiroshi, Saruwatari

PDF

Open Access

TL;DR

This paper introduces an optimal-transport-based interpolation method for Gaussian mixture models to generate synthetic speakers with intermediate attributes, enhancing voice diversity and controllability in speaker synthesis.

Contribution

It presents a novel interpolation technique for GMMs in speaker generation, enabling the creation of voices with intermediate attributes like gender and language fluency.

Findings

01

Effective control of speaker attributes via continuous scalar values.

02

Generated speech maintains naturalness without significant degradation.

03

Method successfully produces mid-attribute, diverse speaker voices.

Abstract

In this paper, we propose a method for intermediating multiple speakers' attributes and diversifying their voice characteristics in ``speaker generation,'' an emerging task that aims to synthesize a nonexistent speaker's naturally sounding voice. The conventional TacoSpawn-based speaker generation method represents the distributions of speaker embeddings by Gaussian mixture models (GMMs) conditioned with speaker attributes. Although this method enables the sampling of various speakers from the speaker-attribute-aware GMMs, it is not yet clear whether the learned distributions can represent speakers with an intermediate attribute (i.e., mid-attribute). To this end, we propose an optimal-transport-based method that interpolates the learned GMMs to generate nonexistent speakers with mid-attribute (e.g., gender-neutral) voices. We empirically validate our method and evaluate the naturalness…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing