Optimizing Multilingual Text-To-Speech with Accents & Emotions
Pranav Pawar, Akshansh Dwivedi, Jenish Boricha, Himanshu Gohil, Aditya Dubey

TL;DR
This paper presents a novel multilingual TTS system that effectively models accents and emotions, especially for Hindi and Indian English, achieving high naturalness and cultural accuracy through innovative architecture and training methods.
Contribution
It introduces a new TTS architecture with culture-sensitive emotion embedding and dynamic accent switching, improving multilingual speech synthesis quality and naturalness.
Findings
23.7% improvement in accent accuracy
85.3% emotion recognition accuracy
High user satisfaction with MOS of 4.2/5
Abstract
State-of-the-art text-to-speech (TTS) systems realize high naturalness in monolingual environments, synthesizing speech with correct multilingual accents (especially for Indic languages) and context-relevant emotions still poses difficulty owing to cultural nuance discrepancies in current frameworks. This paper introduces a new TTS architecture integrating accent along with preserving transliteration with multi-scale emotion modelling, in particularly tuned for Hindi and Indian English accent. Our approach extends the Parler-TTS model by integrating A language-specific phoneme alignment hybrid encoder-decoder architecture, and culture-sensitive emotion embedding layers trained on native speaker corpora, as well as incorporating a dynamic accent code switching with residual vector quantization. Quantitative tests demonstrate 23.7% improvement in accent accuracy (Word Error Rate reduction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Emotion and Mood Recognition · Phonetics and Phonology Research
