OneVoice: One Model, Triple Scenarios-Towards Unified Zero-shot Voice Conversion

Zhichao Wang; Tao Li; Wenshuo Ge; Zihao Cui; Shilei Zhang; Junlan Feng

arXiv:2601.18094·eess.AS·May 22, 2026

OneVoice: One Model, Triple Scenarios-Towards Unified Zero-shot Voice Conversion

Zhichao Wang, Tao Li, Wenshuo Ge, Zihao Cui, Shilei Zhang, Junlan Feng

PDF

TL;DR

OneVoice is a unified zero-shot voice conversion model capable of handling speaker cloning, expressive, and singing scenarios within a single framework, achieving high fidelity and flexible control.

Contribution

The paper introduces a novel unified model using a Mixture-of-Experts architecture and a two-stage training process for multiple voice conversion scenarios.

Findings

01

Matches or surpasses specialized models in all three scenarios.

02

Enables flexible scenario control and fast decoding with as few as 2 steps.

03

Effective handling of diverse voice conversion tasks with a single model.

Abstract

Recent progress of voice conversion~(VC) has achieved a new milestone in speaker cloning and linguistic preservation. But the field remains fragmented, relying on specialized models for linguistic-preserving, expressive, and singing scenarios. We propose OneVoice, a unified zero-shot framework capable of handling all three scenarios within a single model. OneVoice is built upon a continuous language model trained with VAE-free next-patch diffusion, ensuring high fidelity and efficient sequence modeling. Its core design for unification lies in a Mixture-of-Experts (MoE) designed to explicitly model shared conversion knowledge and scenario-specific expressivity. Expert selection is coordinated by a dual-path routing mechanism, including shared expert isolation and scenario-aware domain expert assignment with global-local cues. For precise conditioning, scenario-specific prosodic features…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Topic Modeling