DNN-based Speaker Embedding Using Subjective Inter-speaker Similarity for Multi-speaker Modeling in Speech Synthesis
Yuki Saito, Shinnosuke Takamichi, Hiroshi Saruwatari

TL;DR
This paper introduces novel DNN-based speaker embedding algorithms that utilize subjective inter-speaker similarity scores, enhancing multi-speaker speech synthesis, especially for unseen speakers, by aligning embeddings with human perception.
Contribution
It proposes two new training algorithms for speaker embedding using subjective similarity data, improving correlation with human perception and speech synthesis quality.
Findings
The algorithms produce speaker embeddings highly correlated with subjective similarity.
Similarity vector embedding improves speech quality for unseen speakers.
Crowdsourced similarity scores effectively guide embedding training.
Abstract
This paper proposes novel algorithms for speaker embedding using subjective inter-speaker similarity based on deep neural networks (DNNs). Although conventional DNN-based speaker embedding such as a -vector can be applied to multi-speaker modeling in speech synthesis, it does not correlate with the subjective inter-speaker similarity and is not necessarily appropriate speaker representation for open speakers whose speech utterances are not included in the training data. We propose two training algorithms for DNN-based speaker embedding model using an inter-speaker similarity matrix obtained by large-scale subjective scoring. One is based on similarity vector embedding and trains the model to predict a vector of the similarity matrix as speaker representation. The other is based on similarity matrix embedding and trains the model to minimize the squared Frobenius norm between the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
