Prosody-Adaptable Audio Codecs for Zero-Shot Voice Conversion via In-Context Learning

Junchuan Zhao; Xintong Wang; Ye Wang

arXiv:2505.15402·cs.SD·September 30, 2025

Prosody-Adaptable Audio Codecs for Zero-Shot Voice Conversion via In-Context Learning

Junchuan Zhao, Xintong Wang, Ye Wang

PDF

Open Access

TL;DR

This paper introduces a novel voice conversion model that leverages in-context learning and a prosody-aware codec to improve prosody control, speaker adaptation, and naturalness in speech synthesis.

Contribution

It presents a new VC framework integrating PACE for enhanced prosody manipulation and demonstrates superior performance over existing systems.

Findings

01

Outperforms baselines in prosody preservation

02

Maintains speaker timbre effectively

03

Enhances naturalness of synthesized speech

Abstract

Recent advances in discrete audio codecs have significantly improved speech representation modeling, while codec language models have enabled in-context learning for zero-shot speech synthesis. Inspired by this, we propose a voice conversion (VC) model within the VALLE-X framework, leveraging its strong in-context learning capabilities for speaker adaptation. To enhance prosody control, we introduce a prosody-aware audio codec encoder (PACE) module, which isolates and refines prosody from other sources, improving expressiveness and control. By integrating PACE into our VC model, we achieve greater flexibility in prosody manipulation while preserving speaker timbre. Experimental evaluation results demonstrate that our approach outperforms baseline VC systems in prosody preservation, timbre consistency, and overall naturalness, surpassing baseline VC systems.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing