HiFi-VC: High Quality ASR-Based Voice Conversion

A. Kashkin; I. Karpukhin; S. Shishkin

arXiv:2203.16937·cs.SD·April 1, 2022·1 cites

HiFi-VC: High Quality ASR-Based Voice Conversion

A. Kashkin, I. Karpukhin, S. Shishkin

PDF

Open Access 1 Repo

TL;DR

This paper introduces HiFi-VC, a novel voice conversion system that leverages ASR features, pitch tracking, and advanced waveform prediction to achieve high-quality, any-to-any voice conversion capable of generating natural-sounding speech.

Contribution

The paper presents a new voice conversion pipeline that significantly improves quality and similarity in any-to-any voice conversion using innovative feature extraction and waveform modeling techniques.

Findings

01

Outperforms modern baselines in voice quality

02

Achieves higher similarity and consistency

03

Validated through subjective and objective evaluations

Abstract

The goal of voice conversion (VC) is to convert input voice to match the target speaker's voice while keeping text and prosody intact. VC is usually used in entertainment and speaking-aid systems, as well as applied for speech data generation and augmentation. The development of any-to-any VC systems, which are capable of generating voices unseen during model training, is of particular interest to both researchers and the industry. Despite recent progress, any-to-any conversion quality is still inferior to natural speech. In this work, we propose a new any-to-any voice conversion pipeline. Our approach uses automated speech recognition (ASR) features, pitch tracking, and a state-of-the-art waveform prediction model. According to multiple subjective and objective evaluations, our method outperforms modern baselines in terms of voice quality, similarity and consistency.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

anton-kashkin/hifi_vc
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing