Noro: Noise-Robust One-shot Voice Conversion with Hidden Speaker Representation Learning

Haorui He; Yuchen Song; Yuancheng Wang; Haoyang Li; Xueyao Zhang; Li Wang; Gongping Huang; Eng Siong Chng; and Zhizheng Wu

arXiv:2411.19770·cs.SD·August 29, 2025

Noro: Noise-Robust One-shot Voice Conversion with Hidden Speaker Representation Learning

Haorui He, Yuchen Song, Yuancheng Wang, Haoyang Li, Xueyao Zhang, Li Wang, Gongping Huang, Eng Siong Chng, and Zhizheng Wu

PDF

Open Access

TL;DR

Noro is a noise-robust one-shot voice conversion system that maintains high performance in noisy real-world conditions by employing innovative encoding and contrastive learning techniques, and also advances speaker representation learning.

Contribution

The paper introduces Noro, a novel VC system with noise-robust components and demonstrates its effectiveness, along with exploring hidden speaker representations in VC.

Findings

01

Noro outperforms baseline in noisy and clean scenarios.

02

The reference encoder is competitive with self-supervised models.

03

Noro enhances real-world voice conversion applications.

Abstract

The effectiveness of one-shot voice conversion (VC) decreases in real-world scenarios where reference speeches, which are often sourced from the internet, contain various disturbances like background noise. To address this issue, we introduce Noro, a noise-robust one-shot VC system. Noro features innovative components tailored for VC using noisy reference speeches, including a dual-branch reference encoding module and a noise-agnostic contrastive speaker loss. Experimental results demonstrate that Noro outperforms our baseline system in both clean and noisy scenarios, highlighting its efficacy for real-world applications. Additionally, we investigate the hidden speaker representation capabilities of our baseline system by repurposing its reference encoder as a speaker encoder. The results show that it is competitive with several advanced self-supervised learning models for speaker…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques