UtterTune: LoRA-Based Target-Language Pronunciation Edit and Control in Multilingual Text-to-Speech
Shuhei Kato

TL;DR
UtterTune is a lightweight, low-rank adaptation method that fine-tunes multilingual TTS systems for precise pronunciation control in a target language, demonstrated on Japanese, while preserving overall speech quality.
Contribution
It introduces UtterTune, a novel low-rank adaptation approach enabling targeted pronunciation and pitch control in multilingual TTS without extensive retraining.
Findings
Effective pronunciation control in Japanese TTS.
Maintains naturalness and speaker similarity in zero-shot settings.
Objective and subjective evaluations confirm improvements.
Abstract
We propose UtterTune, a lightweight adaptation method that fine-tunes a multilingual text-to-speech (TTS) system based on a large language model (LLM) architecture, designed to enhance the controllability of pronunciation in a target language while preserving performance in others. While LLM architectures have enabled TTS models to achieve remarkable naturalness, accurately modeling grapheme-to-phoneme (G2P) mapping and prosody remains challenging, especially when the model omits an explicit G2P module and directly processes minimally encoded text (e.g., byte-pair encoding). UtterTune leverages low-rank adaptation to enable the control of segmental pronunciation and pitch accent at the phoneme level for Japanese speech, the target language in this paper, while maintaining naturalness and speaker similarity in a zero-shot setting. Objective and subjective evaluations confirm its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Natural Language Processing Techniques
