PITS: Variational Pitch Inference without Fundamental Frequency for   End-to-End Pitch-controllable TTS

Junhyeok Lee; Wonbin Jung; Hyunjae Cho; Jaeyeon Kim; Jaehwan Kim

arXiv:2302.12391·eess.AS·June 7, 2023·1 cites

PITS: Variational Pitch Inference without Fundamental Frequency for End-to-End Pitch-controllable TTS

Junhyeok Lee, Wonbin Jung, Hyunjae Cho, Jaeyeon Kim, Jaehwan Kim

PDF

Open Access 2 Repos

TL;DR

PITS introduces a variational inference approach for pitch modeling in end-to-end TTS, enabling high-quality, pitch-controllable speech synthesis without relying on fundamental frequency, thus overcoming low variance issues.

Contribution

It presents a novel variational pitch inference method integrated into VITS, enhancing pitch controllability and speech quality in TTS systems.

Findings

01

Generated speech is indistinguishable from ground truth.

02

Achieves high pitch-controllability without quality loss.

03

Demonstrates superior variance in synthesized speech.

Abstract

Previous pitch-controllable text-to-speech (TTS) models rely on directly modeling fundamental frequency, leading to low variance in synthesized speech. To address this issue, we propose PITS, an end-to-end pitch-controllable TTS model that utilizes variational inference to model pitch. Based on VITS, PITS incorporates the Yingram encoder, the Yingram decoder, and adversarial training of pitch-shifted synthesis to achieve pitch-controllability. Experiments demonstrate that PITS generates high-quality speech that is indistinguishable from ground truth speech and has high pitch-controllability without quality degradation. Code, audio samples, and demo are available at https://github.com/anonymous-pits/pits.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsVariational Inference