OneVoice: One Model, Triple Scenarios-Towards Unified Zero-shot Voice Conversion
Zhichao Wang, Tao Li, Wenshuo Ge, Zihao Cui, Shilei Zhang, Junlan Feng

TL;DR
OneVoice is a unified zero-shot voice conversion model capable of handling speaker cloning, expressive, and singing scenarios within a single framework, achieving high fidelity and flexible control.
Contribution
The paper introduces a novel unified model using a Mixture-of-Experts architecture and a two-stage training process for multiple voice conversion scenarios.
Findings
Matches or surpasses specialized models in all three scenarios.
Enables flexible scenario control and fast decoding with as few as 2 steps.
Effective handling of diverse voice conversion tasks with a single model.
Abstract
Recent progress of voice conversion~(VC) has achieved a new milestone in speaker cloning and linguistic preservation. But the field remains fragmented, relying on specialized models for linguistic-preserving, expressive, and singing scenarios. We propose OneVoice, a unified zero-shot framework capable of handling all three scenarios within a single model. OneVoice is built upon a continuous language model trained with VAE-free next-patch diffusion, ensuring high fidelity and efficient sequence modeling. Its core design for unification lies in a Mixture-of-Experts (MoE) designed to explicitly model shared conversion knowledge and scenario-specific expressivity. Expert selection is coordinated by a dual-path routing mechanism, including shared expert isolation and scenario-aware domain expert assignment with global-local cues. For precise conditioning, scenario-specific prosodic features…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Topic Modeling
