Generating Speakers by Prompting Listener Impressions for Pre-trained   Multi-Speaker Text-to-Speech Systems

Zhengyang Chen; Xuechen Liu; Erica Cooper; Junichi Yamagishi; Yanmin; Qian

arXiv:2406.08812·cs.SD·June 14, 2024

Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems

Zhengyang Chen, Xuechen Liu, Erica Cooper, Junichi Yamagishi, Yanmin, Qian

PDF

Open Access

TL;DR

This paper introduces a flexible multi-speaker TTS system that uses listener impressions as prompts to control speaker traits, leveraging LoRA for quick adaptation and combining discriminative and generative methods for improved speech fidelity.

Contribution

It presents a novel prompt-based speaker control method that separates prompt processing from the TTS system, enhancing flexibility and naturalness in speaker trait specification.

Findings

01

Listener impressions effectively guide speaker trait control.

02

Combining discriminative and generative methods improves speech fidelity.

03

The system adapts quickly using LoRA techniques.

Abstract

This paper proposes a speech synthesis system that allows users to specify and control the acoustic characteristics of a speaker by means of prompts describing the speaker's traits of synthesized speech. Unlike previous approaches, our method utilizes listener impressions to construct prompts, which are easier to collect and align more naturally with everyday descriptions of speaker traits. We adopt the Low-rank Adaptation (LoRA) technique to swiftly tailor a pre-trained language model to our needs, facilitating the extraction of speaker-related traits from the prompt text. Besides, different from other prompt-driven text-to-speech (TTS) systems, we separate the prompt-to-speaker module from the multi-speaker TTS system, enhancing system flexibility and compatibility with various pre-trained multi-speaker TTS systems. Moreover, for the prompt-to-speaker characteristic module, we also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Natural Language Processing Techniques