UtterTune: LoRA-Based Target-Language Pronunciation Edit and Control in Multilingual Text-to-Speech

Shuhei Kato

arXiv:2508.09767·cs.SD·September 24, 2025

UtterTune: LoRA-Based Target-Language Pronunciation Edit and Control in Multilingual Text-to-Speech

Shuhei Kato

PDF

Open Access 1 Models

TL;DR

UtterTune is a lightweight, low-rank adaptation method that fine-tunes multilingual TTS systems for precise pronunciation control in a target language, demonstrated on Japanese, while preserving overall speech quality.

Contribution

It introduces UtterTune, a novel low-rank adaptation approach enabling targeted pronunciation and pitch control in multilingual TTS without extensive retraining.

Findings

01

Effective pronunciation control in Japanese TTS.

02

Maintains naturalness and speaker similarity in zero-shot settings.

03

Objective and subjective evaluations confirm improvements.

Abstract

We propose UtterTune, a lightweight adaptation method that fine-tunes a multilingual text-to-speech (TTS) system based on a large language model (LLM) architecture, designed to enhance the controllability of pronunciation in a target language while preserving performance in others. While LLM architectures have enabled TTS models to achieve remarkable naturalness, accurately modeling grapheme-to-phoneme (G2P) mapping and prosody remains challenging, especially when the model omits an explicit G2P module and directly processes minimally encoded text (e.g., byte-pair encoding). UtterTune leverages low-rank adaptation to enable the control of segmental pronunciation and pitch accent at the phoneme level for Japanese speech, the target language in this paper, while maintaining naturalness and speaker similarity in a zero-shot setting. Objective and subjective evaluations confirm its…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
shuheikatoinfo/UtterTune-CosyVoice2-ja-JSUTJVS
model· 3 dl· ♡ 4
3 dl♡ 4

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Natural Language Processing Techniques