Optimizing Multilingual Text-To-Speech with Accents & Emotions

Pranav Pawar; Akshansh Dwivedi; Jenish Boricha; Himanshu Gohil; Aditya Dubey

arXiv:2506.16310·cs.LG·June 23, 2025

Optimizing Multilingual Text-To-Speech with Accents & Emotions

Pranav Pawar, Akshansh Dwivedi, Jenish Boricha, Himanshu Gohil, Aditya Dubey

PDF

Open Access 4 Models

TL;DR

This paper presents a novel multilingual TTS system that effectively models accents and emotions, especially for Hindi and Indian English, achieving high naturalness and cultural accuracy through innovative architecture and training methods.

Contribution

It introduces a new TTS architecture with culture-sensitive emotion embedding and dynamic accent switching, improving multilingual speech synthesis quality and naturalness.

Findings

01

23.7% improvement in accent accuracy

02

85.3% emotion recognition accuracy

03

High user satisfaction with MOS of 4.2/5

Abstract

State-of-the-art text-to-speech (TTS) systems realize high naturalness in monolingual environments, synthesizing speech with correct multilingual accents (especially for Indic languages) and context-relevant emotions still poses difficulty owing to cultural nuance discrepancies in current frameworks. This paper introduces a new TTS architecture integrating accent along with preserving transliteration with multi-scale emotion modelling, in particularly tuned for Hindi and Indian English accent. Our approach extends the Parler-TTS model by integrating A language-specific phoneme alignment hybrid encoder-decoder architecture, and culture-sensitive emotion embedding layers trained on native speaker corpora, as well as incorporating a dynamic accent code switching with residual vector quantization. Quantitative tests demonstrate 23.7% improvement in accent accuracy (Word Error Rate reduction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Emotion and Mood Recognition · Phonetics and Phonology Research