Improving Pronunciation and Accent Conversion through Knowledge   Distillation And Synthetic Ground-Truth from Native TTS

Tuan Nam Nguyen; Seymanur Akti; Ngoc Quan Pham; Alexander Waibel

arXiv:2410.14997·cs.SD·March 5, 2025

Improving Pronunciation and Accent Conversion through Knowledge Distillation And Synthetic Ground-Truth from Native TTS

Tuan Nam Nguyen, Seymanur Akti, Ngoc Quan Pham, Alexander Waibel

PDF

Open Access

TL;DR

This paper introduces a novel accent conversion method that enhances pronunciation and accent naturalness by generating synthetic ground-truth native-like speech using knowledge distillation and TTS, improving comprehensibility.

Contribution

It presents a new approach combining knowledge distillation and synthetic ground-truth generation within the VITS framework for improved accent conversion and pronunciation enhancement.

Findings

01

Produced native-like accent conversion with preserved speaker identity.

02

Improved pronunciation clarity demonstrated in evaluations.

03

Achieved high-quality waveform synthesis for non-native speakers.

Abstract

Previous approaches on accent conversion (AC) mainly aimed at making non-native speech sound more native while maintaining the original content and speaker identity. However, non-native speakers sometimes have pronunciation issues, which can make it difficult for listeners to understand them. Hence, we developed a new AC approach that not only focuses on accent conversion but also improves pronunciation of non-native accented speaker. By providing the non-native audio and the corresponding transcript, we generate the ideal ground-truth audio with native-like pronunciation with original duration and prosody. This ground-truth data aids the model in learning a direct mapping between accented and native speech. We utilize the end-to-end VITS framework to achieve high-quality waveform reconstruction for the AC task. As a result, our system not only produces audio that closely resembles…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems