Stepback: Enhanced Disentanglement for Voice Conversion via Multi-Task Learning
Qian Yang, Calbert Graham

TL;DR
The paper introduces the Stepback network, a deep learning model that improves voice conversion by enhancing disentanglement and content preservation using non-parallel data, reducing training costs and increasing quality.
Contribution
It proposes a novel multi-task learning framework with self-destructive constraints for better disentanglement in voice conversion without parallel data.
Findings
Significant improvement in voice conversion quality.
Reduced training costs compared to traditional methods.
Effective disentanglement of speaker identity and content.
Abstract
Voice conversion (VC) modifies voice characteristics while preserving linguistic content. This paper presents the Stepback network, a novel model for converting speaker identity using non-parallel data. Unlike traditional VC methods that rely on parallel data, our approach leverages deep learning techniques to enhance disentanglement completion and linguistic content preservation. The Stepback network incorporates a dual flow of different domain data inputs and uses constraints with self-destructive amendments to optimize the content encoder. Extensive experiments show that our model significantly improves VC performance, reducing training costs while achieving high-quality voice conversion. The Stepback network's design offers a promising solution for advanced voice conversion tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
