Multi-Scale Accent Modeling and Disentangling for Multi-Speaker   Multi-Accent Text-to-Speech Synthesis

Xuehao Zhou; Mingyang Zhang; Yi Zhou; Zhizheng Wu; Haizhou Li

arXiv:2406.10844·eess.AS·January 3, 2025

Multi-Scale Accent Modeling and Disentangling for Multi-Speaker Multi-Accent Text-to-Speech Synthesis

Xuehao Zhou, Mingyang Zhang, Yi Zhou, Zhizheng Wu, Haizhou Li

PDF

Open Access

TL;DR

This paper introduces a multi-scale accent modeling approach for multi-speaker, multi-accent TTS that disentangles speaker and accent features, enabling high-quality, accent-specific speech synthesis.

Contribution

It proposes a novel multi-scale accent modeling framework with global and local accent representations and a speaker disentanglement strategy for independent accent control.

Findings

01

Outperforms baseline systems in speech quality and accent rendering.

02

Effective multi-scale accent modeling captures both global and phoneme-level variations.

03

Speaker disentanglement enables independent control of speaker identity and accent.

Abstract

Generating speech across different accents while preserving speaker identity is crucial for various real-world applications. However, accurately and independently modeling both speaker and accent characteristics in text-to-speech (TTS) systems is challenging due to the complex variations of accents and the inherent entanglement between speaker and accent identities. In this paper, we propose a novel approach for multi-speaker multi-accent TTS synthesis that aims to synthesize speech for multiple speakers, each with various accents. Our approach employs a multi-scale accent modeling strategy to address accent variations on different levels. Specifically, we introduce both global (utterance level) and local (phoneme level) accent modeling to capture overall accent characteristics within an utterance and fine-grained accent variations across phonemes, respectively. To enable independent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis