On the Cost and Benefits of Training Context with Utterance or Full Conversation Training: A Comparative Stud
Hyouin Liu, Zhikuan Zhang

TL;DR
This study compares utterance-level and full conversation training methods for conversational TTS, finding that utterance-level training yields higher quality and efficiency, while full conversation training faces speaker similarity issues.
Contribution
It provides empirical evidence favoring utterance-level training over full conversation training for resource-efficient and high-quality conversational TTS.
Findings
Utterance-level training achieves higher MOS scores.
Utterance-level training reduces training time by 37%.
Full conversation training suffers from speaker similarity hallucination.
Abstract
Modern TTS systems designed for conversations achieve high-quality utterances but often remain inaccessible publicly. Are existing open-source architectures inadequate, or are current training techniques insufficient? This paper investigates prominent models and their underlying behaviors regarding conversational context. Using 20 GPU-hours on an NVIDIA H100, we empirically examine two approaches: context-based utterance-level training versus full conversation training. Results demonstrate that context-based utterance training achieves superior MOS scores (4.3/5.0 vs 3.7/5.0) and reduces training time by 37%, while full conversation approaches suffer from speaker similarity hallucination issues. These findings provide practical guidelines for conversational TTS development, favoring utterance-level training with contextual conditioning for both resource efficiency and output quality.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Emotion and Mood Recognition
