Unsupervised Cross-Domain Speech-to-Speech Conversion with   Time-Frequency Consistency

Mohammad Asif Khan; Fabien Cardinaux; Stefan Uhlich; Marc Ferras; Asja; Fischer

arXiv:2005.07810·eess.AS·May 20, 2020

Unsupervised Cross-Domain Speech-to-Speech Conversion with Time-Frequency Consistency

Mohammad Asif Khan, Fabien Cardinaux, Stefan Uhlich, Marc Ferras, Asja, Fischer

PDF

Open Access

TL;DR

This paper introduces a novel adversarial training method that enforces time-frequency spectrogram consistency in unsupervised speech-to-speech conversion, improving the naturalness of generated speech.

Contribution

It proposes a condition to encourage spectrogram consistency during GAN training, enhancing speech quality in cross-domain voice conversion tasks.

Findings

01

Spectrogram consistency improves speech naturalness.

02

Model trained with TF consistency outperforms baseline in perceptual quality.

03

Effective in male-to-female and female-to-male voice conversion.

Abstract

In recent years generative adversarial network (GAN) based models have been successfully applied for unsupervised speech-to-speech conversion.The rich compact harmonic view of the magnitude spectrogram is considered a suitable choice for training these models with audio data. To reconstruct the speech signal first a magnitude spectrogram is generated by the neural network, which is then utilized by methods like the Griffin-Lim algorithm to reconstruct a phase spectrogram. This procedure bears the problem that the generated magnitude spectrogram may not be consistent, which is required for finding a phase such that the full spectrogram has a natural-sounding speech waveform. In this work, we approach this problem by proposing a condition encouraging spectrogram consistency during the adversarial training procedure. We demonstrate our approach on the task of translating the voice of a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsGriffin-Lim Algorithm