A Human-in-the-Loop Approach to Improving Cross-Text Prosody Transfer
Himanshu Maurya, Atli Sigurgeirsson

TL;DR
This paper introduces a human-in-the-loop method to enhance cross-text prosody transfer in TTS systems, allowing users to adjust prosody for better text appropriateness, leading to more natural speech synthesis.
Contribution
The paper presents a novel human-in-the-loop approach that improves cross-text prosody transfer by incorporating user adjustments, addressing limitations of existing models.
Findings
Human adjustments increase appropriateness ratings to 57.8%
Limited user effort yields significant prosody improvements
Latent reference space closeness is unreliable for prosodic similarity
Abstract
Text-To-Speech (TTS) prosody transfer models can generate varied prosodic renditions, for the same text, by conditioning on a reference utterance. These models are trained with a reference that is identical to the target utterance. But when the reference utterance differs from the target text, as in cross-text prosody transfer, these models struggle to separate prosody from text, resulting in reduced perceived naturalness. To address this, we propose a Human-in-the-Loop (HitL) approach. HitL users adjust salient correlates of prosody to make the prosody more appropriate for the target text, while maintaining the overall reference prosodic effect. Human adjusted renditions maintain the reference prosody while being rated as more appropriate for the target text of the time. Our analysis suggests that limited user effort suffices for these improvements, and that closeness in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Phonetics and Phonology Research
