Embedding a Differentiable Mel-cepstral Synthesis Filter to a Neural Speech Synthesis System
Takenori Yoshimura, Shinji Takaki, Kazuhiro Nakamura, Keiichiro Oura,, Yukiya Hono, Kei Hashimoto, Yoshihiko Nankaku, Keiichi Tokuda

TL;DR
This paper introduces a differentiable mel-cepstral synthesis filter integrated into a neural speech synthesis system, enabling end-to-end training and precise control over voice characteristics and pitch, resulting in improved speech quality.
Contribution
The paper presents a novel differentiable implementation of the mel-cepstral synthesis filter within a neural speech synthesis framework for end-to-end optimization and enhanced controllability.
Findings
Improved speech quality over baseline systems.
High controllability of voice characteristics and pitch.
Differentiable filter implementation is GPU-friendly.
Abstract
This paper integrates a classic mel-cepstral synthesis filter into a modern neural speech synthesis system towards end-to-end controllable speech synthesis. Since the mel-cepstral synthesis filter is explicitly embedded in neural waveform models in the proposed system, both voice characteristics and the pitch of synthesized speech are highly controlled via a frequency warping parameter and fundamental frequency, respectively. We implement the mel-cepstral synthesis filter as a differentiable and GPU-friendly module to enable the acoustic and waveform models in the proposed system to be simultaneously optimized in an end-to-end manner. Experiments show that the proposed system improves speech quality from a baseline system maintaining controllability. The core PyTorch modules used in the experiments will be publicly available on GitHub.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Speech and Audio Processing
