Causal Prosody Mediation for Text-to-Speech:Counterfactual Training of Duration, Pitch, and Energy in FastSpeech2

Suvendu Sekhar Mohanty

arXiv:2603.11683·cs.SD·March 13, 2026

Causal Prosody Mediation for Text-to-Speech:Counterfactual Training of Duration, Pitch, and Energy in FastSpeech2

Suvendu Sekhar Mohanty

PDF

Open Access

TL;DR

This paper introduces a causal framework for expressive TTS that disentangles emotion from linguistic content, enabling controllable prosody manipulation and improved speech naturalness.

Contribution

It proposes a novel causal prosody mediation model with counterfactual training for disentangling emotion and prosody in TTS, enhancing controllability and expressiveness.

Findings

01

Improved prosody manipulation and emotion rendering in TTS.

02

Higher MOS and emotion accuracy compared to baseline models.

03

Better intelligibility and speaker consistency in emotion transfer.

Abstract

We propose a novel causal prosody mediation framework for expressive text-to-speech (TTS) synthesis. Our approach augments the FastSpeech2 architecture with explicit emotion conditioning and introduces counterfactual training objectives to disentangle emotional prosody from linguistic content. By formulating a structural causal model of how text (content), emotion, and speaker jointly influence prosody (duration, pitch, energy) and ultimately the speech waveform, we derive two complementary loss terms: an Indirect Path Constraint (IPC) to enforce that emotion affects speech only through prosody, and a Counterfactual Prosody Constraint (CPC) to encourage distinct prosody patterns for different emotions. The resulting model is trained on multi-speaker emotional corpora (LibriTTS, EmoV-DB, VCTK) with a combined objective that includes standard spectrogram reconstruction and variance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Mental Health via Writing