Non-autoregressive real-time Accent Conversion model with voice cloning
Vladimir Nechaev, Sergey Kosyakov

TL;DR
This paper introduces a non-autoregressive, real-time accent conversion model with voice cloning that enhances speech quality and reduces latency, suitable for multi-user communication.
Contribution
The paper presents a novel non-autoregressive model capable of real-time accent conversion and voice cloning, addressing latency and flexibility limitations of prior neural network-based systems.
Findings
Improves speech quality and recognition performance.
Enables real-time voice cloning and accent modification.
Demonstrates high-quality, low-latency speech conversion.
Abstract
Currently, the development of Foreign Accent Conversion (FAC) models utilizes deep neural network architectures, as well as ensembles of neural networks for speech recognition and speech generation. The use of these models is limited by architectural features, which does not allow flexible changes in the timbre of the generated speech and requires the accumulation of context, leading to increased delays in generation and makes these systems unsuitable for use in real-time multi-user communication scenarios. We have developed the non-autoregressive model for real-time accent conversion with voice cloning. The model generates native-sounding L1 speech with minimal latency based on input L2 accented speech. The model consists of interconnected modules for extracting accent, gender, and speaker embeddings, converting speech, generating spectrograms, and decoding the resulting spectrogram…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
