On Using Backpropagation for Speech Texture Generation and Voice   Conversion

Jan Chorowski; Ron J. Weiss; Rif A. Saurous; Samy Bengio

arXiv:1712.08363·cs.SD·March 9, 2018

On Using Backpropagation for Speech Texture Generation and Voice Conversion

Jan Chorowski, Ron J. Weiss, Rif A. Saurous, Samy Bengio

PDF

TL;DR

This paper introduces a neural network-based method for speech texture synthesis and voice conversion using backpropagation, neural network inversion, and activation statistics matching, enabling realistic voice transformation with minimal data.

Contribution

The authors demonstrate a novel approach applying backpropagation and neural network inversion for speech synthesis and voice conversion, inspired by image style transfer techniques.

Findings

01

Effective voice conversion with only a few seconds of target data

02

Ability to generate speech babble and reconstruct utterances in different voices

03

Utilizes a differentiable feature extraction pipeline for optimization

Abstract

Inspired by recent work on neural network image generation which rely on backpropagation towards the network inputs, we present a proof-of-concept system for speech texture synthesis and voice conversion based on two mechanisms: approximate inversion of the representation learned by a speech recognition neural network, and on matching statistics of neuron activations between different source and target utterances. Similar to image texture synthesis and neural style transfer, the system works by optimizing a cost function with respect to the input waveform samples. To this end we use a differentiable mel-filterbank feature extraction pipeline and train a convolutional CTC speech recognition network. Our system is able to extract speaker characteristics from very limited amounts of target speaker data, as little as a few seconds, and can be used to generate realistic speech babble or…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.