Accent Conversion in Text-To-Speech Using Multi-Level VAE and   Adversarial Training

Jan Melechovsky; Ambuj Mehrish; Berrak Sisman; Dorien Herremans

arXiv:2406.01018·eess.AS·October 1, 2024

Accent Conversion in Text-To-Speech Using Multi-Level VAE and Adversarial Training

Jan Melechovsky, Ambuj Mehrish, Berrak Sisman, Dorien Herremans

PDF

Open Access

TL;DR

This paper introduces a novel TTS model using multi-level VAE and adversarial training to improve accent conversion, aiming to create more inclusive speech synthesis systems that accurately represent diverse accents.

Contribution

The paper presents a new TTS approach combining multi-level VAE and adversarial learning for enhanced accent conversion, addressing inclusivity in speech technology.

Findings

01

Improved accent conversion performance over baseline models

02

Enhanced subjective listening test scores for accent accuracy

03

Objective metrics indicate better accent representation

Abstract

With rapid globalization, the need to build inclusive and representative speech technology cannot be overstated. Accent is an important aspect of speech that needs to be taken into consideration while building inclusive speech synthesizers. Inclusive speech technology aims to erase any biases towards specific groups, such as people of certain accent. We note that state-of-the-art Text-to-Speech (TTS) systems may currently not be suitable for all people, regardless of their background, as they are designed to generate high-quality voices without focusing on accent. In this paper, we propose a TTS model that utilizes a Multi-Level Variational Autoencoder with adversarial learning to address accented speech synthesis and conversion in TTS, with a vision for more inclusive systems in the future. We evaluate the performance through both objective metrics and subjective listening tests. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques