Controllable Speaking Styles Using a Large Language Model

Atli Thor Sigurgeirsson; Simon King

arXiv:2305.10321·cs.CL·September 20, 2023·1 cites

Controllable Speaking Styles Using a Large Language Model

Atli Thor Sigurgeirsson, Simon King

PDF

Open Access

TL;DR

This paper introduces a novel approach using large language models to directly suggest prosodic modifications for controllable TTS, enabling style and dialogue-appropriate prosody without reference utterances.

Contribution

It proposes leveraging LLMs to guide prosodic control in TTS directly from prompts, bypassing the need for reference utterances or prompt-labelled speech training.

Findings

01

Rated most appropriate in 50% of cases versus 31% for baseline.

02

Demonstrated control of speaking style and dialogue-appropriate prosody.

03

Effective prompt-based control without reference utterances.

Abstract

Reference-based Text-to-Speech (TTS) models can generate multiple, prosodically-different renditions of the same target text. Such models jointly learn a latent acoustic space during training, which can be sampled from during inference. Controlling these models during inference typically requires finding an appropriate reference utterance, which is non-trivial. Large generative language models (LLMs) have shown excellent performance in various language-related tasks. Given only a natural language query text (the prompt), such models can be used to solve specific, context-dependent tasks. Recent work in TTS has attempted similar prompt-based control of novel speaking style generation. Those methods do not require a reference utterance and can, under ideal conditions, be controlled with only a prompt. But existing methods typically require a prompt-labelled speech corpus for jointly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling