On Using Backpropagation for Speech Texture Generation and Voice Conversion
Jan Chorowski, Ron J. Weiss, Rif A. Saurous, Samy Bengio

TL;DR
This paper introduces a neural network-based method for speech texture synthesis and voice conversion using backpropagation, neural network inversion, and activation statistics matching, enabling realistic voice transformation with minimal data.
Contribution
The authors demonstrate a novel approach applying backpropagation and neural network inversion for speech synthesis and voice conversion, inspired by image style transfer techniques.
Findings
Effective voice conversion with only a few seconds of target data
Ability to generate speech babble and reconstruct utterances in different voices
Utilizes a differentiable feature extraction pipeline for optimization
Abstract
Inspired by recent work on neural network image generation which rely on backpropagation towards the network inputs, we present a proof-of-concept system for speech texture synthesis and voice conversion based on two mechanisms: approximate inversion of the representation learned by a speech recognition neural network, and on matching statistics of neuron activations between different source and target utterances. Similar to image texture synthesis and neural style transfer, the system works by optimizing a cost function with respect to the input waveform samples. To this end we use a differentiable mel-filterbank feature extraction pipeline and train a convolutional CTC speech recognition network. Our system is able to extract speaker characteristics from very limited amounts of target speaker data, as little as a few seconds, and can be used to generate realistic speech babble or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
