Non-autoregressive real-time Accent Conversion model with voice cloning

Vladimir Nechaev; Sergey Kosyakov

arXiv:2405.13162·cs.SD·May 24, 2024·1 cites

Non-autoregressive real-time Accent Conversion model with voice cloning

Vladimir Nechaev, Sergey Kosyakov

PDF

Open Access

TL;DR

This paper introduces a non-autoregressive, real-time accent conversion model with voice cloning that enhances speech quality and reduces latency, suitable for multi-user communication.

Contribution

The paper presents a novel non-autoregressive model capable of real-time accent conversion and voice cloning, addressing latency and flexibility limitations of prior neural network-based systems.

Findings

01

Improves speech quality and recognition performance.

02

Enables real-time voice cloning and accent modification.

03

Demonstrates high-quality, low-latency speech conversion.

Abstract

Currently, the development of Foreign Accent Conversion (FAC) models utilizes deep neural network architectures, as well as ensembles of neural networks for speech recognition and speech generation. The use of these models is limited by architectural features, which does not allow flexible changes in the timbre of the generated speech and requires the accumulation of context, leading to increased delays in generation and makes these systems unsuitable for use in real-time multi-user communication scenarios. We have developed the non-autoregressive model for real-time accent conversion with voice cloning. The model generates native-sounding L1 speech with minimal latency based on input L2 accented speech. The model consists of interconnected modules for extracting accent, gender, and speaker embeddings, converting speech, generating spectrograms, and decoding the resulting spectrogram…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing