Preserving background sound in noise-robust voice conversion via multi-task learning
Jixun Yao, Yi Lei, Qing Wang, Pengcheng Guo, Ziqian Ning, Lei Xie, Hai, Li, Junhui Liu, Danming Xie

TL;DR
This paper introduces an end-to-end multi-task learning framework for voice conversion that preserves background sound, explicitly considers phase information, and reduces distortion caused by source separation, improving performance in noisy scenarios.
Contribution
The proposed framework effectively preserves background sound in voice conversion by integrating source separation and VC with joint training, addressing prior focus on clean voices.
Findings
Significantly outperforms baseline systems in preserving background sound.
Achieves comparable quality and speaker similarity to models trained on clean data.
Reduces distortion caused by imperfect source separation.
Abstract
Background sound is an informative form of art that is helpful in providing a more immersive experience in real-application voice conversion (VC) scenarios. However, prior research about VC, mainly focusing on clean voices, pay rare attention to VC with background sound. The critical problem for preserving background sound in VC is inevitable speech distortion by the neural separation model and the cascade mismatch between the source separation model and the VC model. In this paper, we propose an end-to-end framework via multi-task learning which sequentially cascades a source separation (SS) module, a bottleneck feature extraction module and a VC module. Specifically, the source separation task explicitly considers critical phase information and confines the distortion caused by the imperfect separation process. The source separation task, the typical VC task and the unified task…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
