Unsupervised Cross-Domain Speech-to-Speech Conversion with Time-Frequency Consistency
Mohammad Asif Khan, Fabien Cardinaux, Stefan Uhlich, Marc Ferras, Asja, Fischer

TL;DR
This paper introduces a novel adversarial training method that enforces time-frequency spectrogram consistency in unsupervised speech-to-speech conversion, improving the naturalness of generated speech.
Contribution
It proposes a condition to encourage spectrogram consistency during GAN training, enhancing speech quality in cross-domain voice conversion tasks.
Findings
Spectrogram consistency improves speech naturalness.
Model trained with TF consistency outperforms baseline in perceptual quality.
Effective in male-to-female and female-to-male voice conversion.
Abstract
In recent years generative adversarial network (GAN) based models have been successfully applied for unsupervised speech-to-speech conversion.The rich compact harmonic view of the magnitude spectrogram is considered a suitable choice for training these models with audio data. To reconstruct the speech signal first a magnitude spectrogram is generated by the neural network, which is then utilized by methods like the Griffin-Lim algorithm to reconstruct a phase spectrogram. This procedure bears the problem that the generated magnitude spectrogram may not be consistent, which is required for finding a phase such that the full spectrogram has a natural-sounding speech waveform. In this work, we approach this problem by proposing a condition encouraging spectrogram consistency during the adversarial training procedure. We demonstrate our approach on the task of translating the voice of a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsGriffin-Lim Algorithm
