SpeechAccentLLM: A Unified Framework for Foreign Accent Conversion and Text to Speech

Zhuangfei Cheng; Guangyan Zhang; Zehai Tu; Yangyang Song; Shuiyang Mao; Xiaoqi Jiao; Jingyu Li; Yiwen Guo; Jiasong Wu

arXiv:2507.01348·eess.AS·July 9, 2025

SpeechAccentLLM: A Unified Framework for Foreign Accent Conversion and Text to Speech

Zhuangfei Cheng, Guangyan Zhang, Zehai Tu, Yangyang Song, Shuiyang Mao, Xiaoqi Jiao, Jingyu Li, Yiwen Guo, Jiasong Wu

PDF

Open Access

TL;DR

SpeechAccentLLM introduces a unified framework leveraging LLM techniques for foreign accent conversion and TTS, featuring novel models for speech tokenization, multitask training for data efficiency, and postprocessing for quality enhancement.

Contribution

The paper presents SpeechCodeVAE for speech tokenization with CTC integration, a multitask training strategy for improved FAC and TTS, and SpeechRestorer for postprocessing to reduce errors and improve prosody.

Findings

01

SpeechCodeVAE achieves optimal content and structural trade-offs.

02

Multitask training accelerates convergence and improves speech quality.

03

SpeechRestorer effectively reduces errors and enhances prosody.

Abstract

Foreign accent conversion (FAC) in speech processing remains a challenging task. Building on the remarkable success of large language models (LLMs) in Text-to-Speech (TTS) tasks, this study investigates the adaptation of LLM-based techniques for FAC, which we term SpeechAccentLLM. At the core of this framework, we introduce SpeechCodeVAE, the first model to integrate connectionist temporal classification (CTC) directly into codebook discretization for speech content tokenization. This novel architecture generates tokens with a unique "locality" property, as validated by experiments demonstrating optimal trade-offs among content faithfulness, temporal coherence, and structural recoverability. Then, to address data scarcity for the FAC module, we adopted a multitask learning strategy that jointly trains the FAC and TTS modules. Beyond mitigating data limitations, this approach yielded…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Natural Language Processing Techniques