Prosody-Adaptable Audio Codecs for Zero-Shot Voice Conversion via In-Context Learning
Junchuan Zhao, Xintong Wang, Ye Wang

TL;DR
This paper introduces a novel voice conversion model that leverages in-context learning and a prosody-aware codec to improve prosody control, speaker adaptation, and naturalness in speech synthesis.
Contribution
It presents a new VC framework integrating PACE for enhanced prosody manipulation and demonstrates superior performance over existing systems.
Findings
Outperforms baselines in prosody preservation
Maintains speaker timbre effectively
Enhances naturalness of synthesized speech
Abstract
Recent advances in discrete audio codecs have significantly improved speech representation modeling, while codec language models have enabled in-context learning for zero-shot speech synthesis. Inspired by this, we propose a voice conversion (VC) model within the VALLE-X framework, leveraging its strong in-context learning capabilities for speaker adaptation. To enhance prosody control, we introduce a prosody-aware audio codec encoder (PACE) module, which isolates and refines prosody from other sources, improving expressiveness and control. By integrating PACE into our VC model, we achieve greater flexibility in prosody manipulation while preserving speaker timbre. Experimental evaluation results demonstrate that our approach outperforms baseline VC systems in prosody preservation, timbre consistency, and overall naturalness, surpassing baseline VC systems.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
