Improving Stability of LS-GANs for Audio and Speech Signals
Mohammad Esmaeilpour, Raymel Alfonso Sallo, Olivier St-Georges,, Patrick Cardinal, Alessandro Lameiras Koerich

TL;DR
This paper introduces a novel similarity metric based on Schur decomposition to improve the stability and quality of LS-GANs in generating audio and speech signals, reducing mode collapse and enhancing sample fidelity.
Contribution
The paper proposes a new metric in unitary space for LS-GANs that improves training stability and sample quality in audio and speech signal generation.
Findings
Enhanced training stability with less mode collapse.
Significant reduction in Fréchet inception distance.
Higher signal-to-noise ratio in reconstructed signals.
Abstract
In this paper we address the instability issue of generative adversarial network (GAN) by proposing a new similarity metric in unitary space of Schur decomposition for 2D representations of audio and speech signals. We show that encoding departure from normality computed in this vector space into the generator optimization formulation helps to craft more comprehensive spectrograms. We demonstrate the effectiveness of binding this metric for enhancing stability in training with less mode collapse compared to baseline GANs. Experimental results on subsets of UrbanSound8k and Mozilla common voice datasets have shown considerable improvements on the quality of the generated samples measured by the Fr\'echet inception distance. Moreover, reconstructed signals from these samples, have achieved higher signal to noise ratio compared to regular LS-GANs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
