Causal Prosody Mediation for Text-to-Speech:Counterfactual Training of Duration, Pitch, and Energy in FastSpeech2
Suvendu Sekhar Mohanty

TL;DR
This paper introduces a causal framework for expressive TTS that disentangles emotion from linguistic content, enabling controllable prosody manipulation and improved speech naturalness.
Contribution
It proposes a novel causal prosody mediation model with counterfactual training for disentangling emotion and prosody in TTS, enhancing controllability and expressiveness.
Findings
Improved prosody manipulation and emotion rendering in TTS.
Higher MOS and emotion accuracy compared to baseline models.
Better intelligibility and speaker consistency in emotion transfer.
Abstract
We propose a novel causal prosody mediation framework for expressive text-to-speech (TTS) synthesis. Our approach augments the FastSpeech2 architecture with explicit emotion conditioning and introduces counterfactual training objectives to disentangle emotional prosody from linguistic content. By formulating a structural causal model of how text (content), emotion, and speaker jointly influence prosody (duration, pitch, energy) and ultimately the speech waveform, we derive two complementary loss terms: an Indirect Path Constraint (IPC) to enforce that emotion affects speech only through prosody, and a Counterfactual Prosody Constraint (CPC) to encourage distinct prosody patterns for different emotions. The resulting model is trained on multi-speaker emotional corpora (LibriTTS, EmoV-DB, VCTK) with a combined objective that includes standard spectrogram reconstruction and variance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Mental Health via Writing
