Stepback: Enhanced Disentanglement for Voice Conversion via Multi-Task   Learning

Qian Yang; Calbert Graham

arXiv:2501.15613·cs.SD·January 28, 2025

Stepback: Enhanced Disentanglement for Voice Conversion via Multi-Task Learning

Qian Yang, Calbert Graham

PDF

Open Access

TL;DR

The paper introduces the Stepback network, a deep learning model that improves voice conversion by enhancing disentanglement and content preservation using non-parallel data, reducing training costs and increasing quality.

Contribution

It proposes a novel multi-task learning framework with self-destructive constraints for better disentanglement in voice conversion without parallel data.

Findings

01

Significant improvement in voice conversion quality.

02

Reduced training costs compared to traditional methods.

03

Effective disentanglement of speaker identity and content.

Abstract

Voice conversion (VC) modifies voice characteristics while preserving linguistic content. This paper presents the Stepback network, a novel model for converting speaker identity using non-parallel data. Unlike traditional VC methods that rely on parallel data, our approach leverages deep learning techniques to enhance disentanglement completion and linguistic content preservation. The Stepback network incorporates a dual flow of different domain data inputs and uses constraints with self-destructive amendments to optimize the content encoder. Extensive experiments show that our model significantly improves VC performance, reducing training costs while achieving high-quality voice conversion. The Stepback network's design offers a promising solution for advanced voice conversion tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing