Counterfactual Activation Editing for Post-hoc Prosody and Mispronunciation Correction in TTS Models

Kyowoon Lee; Artyom Stitsyuk; Gunu Jho; Inchul Hwang; Jaesik Choi

arXiv:2506.00832·cs.SD·June 3, 2025

Counterfactual Activation Editing for Post-hoc Prosody and Mispronunciation Correction in TTS Models

Kyowoon Lee, Artyom Stitsyuk, Gunu Jho, Inchul Hwang, Jaesik Choi

PDF

Open Access

TL;DR

This paper presents Counterfactual Activation Editing, a novel post-hoc method for controlling prosody and correcting mispronunciations in pre-trained TTS models without retraining, enhancing flexibility and practicality.

Contribution

It introduces a model-agnostic, inference-time technique that manipulates internal activations to enable post-hoc prosody and mispronunciation adjustments in TTS models.

Findings

01

Effective prosody adjustment demonstrated

02

Successful mispronunciation correction shown

03

Preserves overall speech quality

Abstract

Recent advances in Text-to-Speech (TTS) have significantly improved speech naturalness, increasing the demand for precise prosody control and mispronunciation correction. Existing approaches for prosody manipulation often depend on specialized modules or additional training, limiting their capacity for post-hoc adjustments. Similarly, traditional mispronunciation correction relies on grapheme-to-phoneme dictionaries, making it less practical in low-resource settings. We introduce Counterfactual Activation Editing, a model-agnostic method that manipulates internal representations in a pre-trained TTS model to achieve post-hoc control of prosody and pronunciation. Experimental results show that our method effectively adjusts prosodic features and corrects mispronunciations while preserving synthesis quality. This opens the door to inference-time refinement of TTS outputs without…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Phonetics and Phonology Research