High quality voice conversion using prosodic and high-resolution   spectral features

Hy Quy Nguyen; Siu Wa Lee; Xiaohai Tian; Minghui Dong; Eng; Siong Chng

arXiv:1512.01809·cs.SD·December 8, 2015

High quality voice conversion using prosodic and high-resolution spectral features

Hy Quy Nguyen, Siu Wa Lee, Xiaohai Tian, Minghui Dong, Eng, Siong Chng

PDF

TL;DR

This paper presents a deep neural network framework that converts both spectral and prosodic features for high-quality voice conversion, utilizing autoencoder pretraining and segmental models to improve speech naturalness.

Contribution

The work introduces a novel DNN-based voice conversion method that jointly models high-resolution spectral and prosodic features with autoencoder pretraining and segmental prosody modeling.

Findings

01

Enhanced speech quality in objective evaluations

02

Improved naturalness in subjective listening tests

03

Effective modeling of spectral and prosodic features

Abstract

Voice conversion methods have advanced rapidly over the last decade. Studies have shown that speaker characteristics are captured by spectral feature as well as various prosodic features. Most existing conversion methods focus on the spectral feature as it directly represents the timbre characteristics, while some conversion methods have focused only on the prosodic feature represented by the fundamental frequency. In this paper, a comprehensive framework using deep neural networks to convert both timbre and prosodic features is proposed. The timbre feature is represented by a high-resolution spectral feature. The prosodic features include F0, intensity and duration. It is well known that DNN is useful as a tool to model high-dimensional features. In this work, we show that DNN initialized by our proposed autoencoder pretraining yields good quality DNN conversion models. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.