TL;DR
SVSNet is an end-to-end neural network that directly assesses speaker voice similarity from raw speech waveforms, outperforming traditional feature-based metrics in voice conversion evaluation.
Contribution
It introduces the first end-to-end model for speaker similarity assessment that uses raw waveforms, eliminating the need for hand-crafted features.
Findings
SVSNet outperforms baseline systems on VCC2018 and VCC2020 datasets.
It effectively assesses speaker similarity at both utterance and system levels.
The model demonstrates superior correlation with human judgments.
Abstract
Neural evaluation metrics derived for numerous speech generation tasks have recently attracted great attention. In this paper, we propose SVSNet, the first end-to-end neural network model to assess the speaker voice similarity between converted speech and natural speech for voice conversion tasks. Unlike most neural evaluation metrics that use hand-crafted features, SVSNet directly takes the raw waveform as input to more completely utilize speech information for prediction. SVSNet consists of encoder, co-attention, distance calculation, and prediction modules and is trained in an end-to-end manner. The experimental results on the Voice Conversion Challenge 2018 and 2020 (VCC2018 and VCC2020) datasets show that SVSNet outperforms well-known baseline systems in the assessment of speaker similarity at the utterance and system levels.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
