StreamVoice+: Evolving into End-to-end Streaming Zero-shot Voice   Conversion

Zhichao Wang; Yuanzhe Chen; Xinsheng Wang; Lei Xie; Yuping Wang

arXiv:2408.02178·eess.AS·August 6, 2024

StreamVoice+: Evolving into End-to-end Streaming Zero-shot Voice Conversion

Zhichao Wang, Yuanzhe Chen, Xinsheng Wang, Lei Xie, Yuping Wang

PDF

Open Access

TL;DR

StreamVoice+ presents an end-to-end streaming voice conversion framework that improves naturalness and speaker similarity by eliminating reliance on streaming ASR and introducing novel training and refinement strategies.

Contribution

It introduces StreamVoice+, an end-to-end streaming voice conversion model that operates independently of streaming ASR and employs a two-stage training process with residual compensation and self-refinement.

Findings

01

Higher naturalness and speaker similarity compared to StreamVoice

02

Supports both streaming and non-streaming voice conversion scenarios

03

Achieves improved conversion stability and quality

Abstract

StreamVoice has recently pushed the boundaries of zero-shot voice conversion (VC) in the streaming domain. It uses a streamable language model (LM) with a context-aware approach to convert semantic features from automatic speech recognition (ASR) into acoustic features with the desired speaker timbre. Despite its innovations, StreamVoice faces challenges due to its dependency on a streaming ASR within a cascaded framework, which complicates system deployment and optimization, affects VC system's design and performance based on the choice of ASR, and struggles with conversion stability when faced with low-quality semantic inputs. To overcome these limitations, we introduce StreamVoice+, an enhanced LM-based end-to-end streaming framework that operates independently of streaming ASR. StreamVoice+ integrates a semantic encoder and a connector with the original StreamVoice framework, now…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems