HiFi-VC: High Quality ASR-Based Voice Conversion
A. Kashkin, I. Karpukhin, S. Shishkin

TL;DR
This paper introduces HiFi-VC, a novel voice conversion system that leverages ASR features, pitch tracking, and advanced waveform prediction to achieve high-quality, any-to-any voice conversion capable of generating natural-sounding speech.
Contribution
The paper presents a new voice conversion pipeline that significantly improves quality and similarity in any-to-any voice conversion using innovative feature extraction and waveform modeling techniques.
Findings
Outperforms modern baselines in voice quality
Achieves higher similarity and consistency
Validated through subjective and objective evaluations
Abstract
The goal of voice conversion (VC) is to convert input voice to match the target speaker's voice while keeping text and prosody intact. VC is usually used in entertainment and speaking-aid systems, as well as applied for speech data generation and augmentation. The development of any-to-any VC systems, which are capable of generating voices unseen during model training, is of particular interest to both researchers and the industry. Despite recent progress, any-to-any conversion quality is still inferior to natural speech. In this work, we propose a new any-to-any voice conversion pipeline. Our approach uses automated speech recognition (ASR) features, pitch tracking, and a state-of-the-art waveform prediction model. According to multiple subjective and objective evaluations, our method outperforms modern baselines in terms of voice quality, similarity and consistency.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
