Controllable Speaking Styles Using a Large Language Model
Atli Thor Sigurgeirsson, Simon King

TL;DR
This paper introduces a novel approach using large language models to directly suggest prosodic modifications for controllable TTS, enabling style and dialogue-appropriate prosody without reference utterances.
Contribution
It proposes leveraging LLMs to guide prosodic control in TTS directly from prompts, bypassing the need for reference utterances or prompt-labelled speech training.
Findings
Rated most appropriate in 50% of cases versus 31% for baseline.
Demonstrated control of speaking style and dialogue-appropriate prosody.
Effective prompt-based control without reference utterances.
Abstract
Reference-based Text-to-Speech (TTS) models can generate multiple, prosodically-different renditions of the same target text. Such models jointly learn a latent acoustic space during training, which can be sampled from during inference. Controlling these models during inference typically requires finding an appropriate reference utterance, which is non-trivial. Large generative language models (LLMs) have shown excellent performance in various language-related tasks. Given only a natural language query text (the prompt), such models can be used to solve specific, context-dependent tasks. Recent work in TTS has attempted similar prompt-based control of novel speaking style generation. Those methods do not require a reference utterance and can, under ideal conditions, be controlled with only a prompt. But existing methods typically require a prompt-labelled speech corpus for jointly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling
