Investigating Deep Neural Structures and their Interpretability in the Domain of Voice Conversion
Samuel J. Broughton, Md Asif Jalal, Roger K. Moore

TL;DR
This paper investigates the interpretability of non-parallel GANs in voice conversion, revealing that layer depth significantly influences output quality and that learned representations remain highly similar across datasets.
Contribution
It provides new insights into how deep generative networks' representations evolve and highlights the importance of layer count in voice conversion GANs.
Findings
Layer depth impacts voice conversion quality.
Learned representations stay similar across datasets.
Layer representations remain close to initial parameters.
Abstract
Generative Adversarial Networks (GANs) are machine learning networks based around creating synthetic data. Voice Conversion (VC) is a subset of voice translation that involves translating the paralinguistic features of a source speaker to a target speaker while preserving the linguistic information. The aim of non-parallel conditional GANs for VC is to translate an acoustic speech feature sequence from one domain to another without the use of paired data. In the study reported here, we investigated the interpretability of state-of-the-art implementations of non-parallel GANs in the domain of VC. We show that the learned representations in the repeating layers of a particular GAN architecture remain close to their original random initialised parameters, demonstrating that it is the number of repeating layers that is more responsible for the quality of the output. We also analysed the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
