Embedding a Differentiable Mel-cepstral Synthesis Filter to a Neural   Speech Synthesis System

Takenori Yoshimura; Shinji Takaki; Kazuhiro Nakamura; Keiichiro Oura,; Yukiya Hono; Kei Hashimoto; Yoshihiko Nankaku; Keiichi Tokuda

arXiv:2211.11222·eess.AS·November 22, 2022·1 cites

Embedding a Differentiable Mel-cepstral Synthesis Filter to a Neural Speech Synthesis System

Takenori Yoshimura, Shinji Takaki, Kazuhiro Nakamura, Keiichiro Oura,, Yukiya Hono, Kei Hashimoto, Yoshihiko Nankaku, Keiichi Tokuda

PDF

Open Access 1 Repo

TL;DR

This paper introduces a differentiable mel-cepstral synthesis filter integrated into a neural speech synthesis system, enabling end-to-end training and precise control over voice characteristics and pitch, resulting in improved speech quality.

Contribution

The paper presents a novel differentiable implementation of the mel-cepstral synthesis filter within a neural speech synthesis framework for end-to-end optimization and enhanced controllability.

Findings

01

Improved speech quality over baseline systems.

02

High controllability of voice characteristics and pitch.

03

Differentiable filter implementation is GPU-friendly.

Abstract

This paper integrates a classic mel-cepstral synthesis filter into a modern neural speech synthesis system towards end-to-end controllable speech synthesis. Since the mel-cepstral synthesis filter is explicitly embedded in neural waveform models in the proposed system, both voice characteristics and the pitch of synthesized speech are highly controlled via a frequency warping parameter and fundamental frequency, respectively. We implement the mel-cepstral synthesis filter as a differentiable and GPU-friendly module to enable the acoustic and waveform models in the proposed system to be simultaneously optimized in an end-to-end manner. Experiments show that the proposed system improves speech quality from a baseline system maintaining controllability. The core PyTorch modules used in the experiments will be publicly available on GitHub.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sp-nitech/diffsptk
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Speech and Audio Processing