SVSNet+: Enhancing Speaker Voice Similarity Assessment Models with   Representations from Speech Foundation Models

Chun Yin; Tai-Shih Chi; Yu Tsao; Hsin-Min Wang

arXiv:2406.08445·eess.AS·June 13, 2024

SVSNet+: Enhancing Speaker Voice Similarity Assessment Models with Representations from Speech Foundation Models

Chun Yin, Tai-Shih Chi, Yu Tsao, Hsin-Min Wang

PDF

Open Access

TL;DR

This paper introduces SVSNet+, a model that leverages pre-trained speech foundation model representations, like WavLM, to significantly enhance speaker voice similarity assessment accuracy across multiple datasets.

Contribution

SVSNet+ is the first to systematically incorporate pre-trained speech foundation model representations into speaker similarity assessment, demonstrating improved performance and generalization.

Findings

01

SVSNet+ with WavLM outperforms baseline models on Voice Conversion Challenge datasets.

02

Learning a weighted-sum of WavLM features improves performance more than fine-tuning.

03

SVSNet+ maintains strong performance with different speech foundation models.

Abstract

Representations from pre-trained speech foundation models (SFMs) have shown impressive performance in many downstream tasks. However, the potential benefits of incorporating pre-trained SFM representations into speaker voice similarity assessment have not been thoroughly investigated. In this paper, we propose SVSNet+, a model that integrates pre-trained SFM representations to improve performance in assessing speaker voice similarity. Experimental results on the Voice Conversion Challenge 2018 and 2020 datasets show that SVSNet+ incorporating WavLM representations shows significant improvements compared to baseline models. In addition, while fine-tuning WavLM with a small dataset of the downstream task does not improve performance, using the same dataset to learn a weighted-sum representation of WavLM can substantially improve performance. Furthermore, when WavLM is replaced by other…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis