HierVST: Hierarchical Adaptive Zero-shot Voice Style Transfer

Sang-Hoon Lee; Ha-Yeong Choi; Hyung-Seok Oh; Seong-Whan Lee

arXiv:2307.16171·cs.SD·August 1, 2023

HierVST: Hierarchical Adaptive Zero-shot Voice Style Transfer

Sang-Hoon Lee, Ha-Yeong Choi, Hyung-Seok Oh, Seong-Whan Lee

PDF

Open Access

TL;DR

HierVST introduces a hierarchical adaptive zero-shot voice style transfer model that effectively adapts to new speakers without text transcripts, utilizing hierarchical variational inference and self-supervised learning for progressive speech conversion.

Contribution

The paper proposes a novel hierarchical adaptive end-to-end zero-shot VST model that improves transfer to unseen speakers without requiring text transcripts.

Findings

01

Outperforms existing VST models in zero-shot scenarios

02

Effectively adapts to novel voice styles

03

Progressively converts speech with hierarchical structure

Abstract

Despite rapid progress in the voice style transfer (VST) field, recent zero-shot VST systems still lack the ability to transfer the voice style of a novel speaker. In this paper, we present HierVST, a hierarchical adaptive end-to-end zero-shot VST model. Without any text transcripts, we only use the speech dataset to train the model by utilizing hierarchical variational inference and self-supervised representation. In addition, we adopt a hierarchical adaptive generator that generates the pitch representation and waveform audio sequentially. Moreover, we utilize unconditional generation to improve the speaker-relative acoustic capacity in the acoustic representation. With a hierarchical adaptive structure, the model can adapt to a novel voice style and convert speech progressively. The experimental results demonstrate that our method outperforms other VST models in zero-shot VST…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsVariational Inference