FastPitchFormant: Source-filter based Decomposed Modeling for Speech   Synthesis

Taejun Bak; Jae-Sung Bae; Hanbin Bae; Young-Ik Kim; Hoon-Young Cho

arXiv:2106.15123·eess.AS·June 30, 2021

FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis

Taejun Bak, Jae-Sung Bae, Hanbin Bae, Young-Ik Kim, Hoon-Young Cho

PDF

Open Access 1 Repo

TL;DR

FastPitchFormant introduces a source-filter based neural TTS model that improves prosody control and speech quality by separately modeling text and acoustic features, reducing pitch-shift artifacts and speaker deformation.

Contribution

It presents a novel feed-forward Transformer TTS model based on source-filter theory with parallel feature handling to enhance prosody and quality.

Findings

01

Reduces audio quality degradation in large pitch-shift synthesis.

02

Mitigates speaker characteristic deformation during prosody modification.

03

Separately models text and acoustic features for better control.

Abstract

Methods for modeling and controlling prosody with acoustic features have been proposed for neural text-to-speech (TTS) models. Prosodic speech can be generated by conditioning acoustic features. However, synthesized speech with a large pitch-shift scale suffers from audio quality degradation, and speaker characteristics deformation. To address this problem, we propose a feed-forward Transformer based TTS model that is designed based on the source-filter theory. This model, called FastPitchFormant, has a unique structure that handles text and acoustic features in parallel. With modeling each feature separately, the tendency that the model learns the relationship between two features can be mitigated.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

keonlee9420/FastPitchFormant
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Layer Normalization · Dropout · Label Smoothing