Convert and Speak: Zero-shot Accent Conversion with Minimum Supervision
Zhijun Jia, Huaying Xue, Xiulian Peng, Yan Lu

TL;DR
This paper introduces a two-stage zero-shot accent conversion framework that uses minimal supervision and semantic tokens, enabling high-quality accent conversion with very limited parallel data.
Contribution
The proposed 'convert-and-speak' framework decouples accent conversion and speech synthesis, reducing data requirements and leveraging language pre-training for effective zero-shot accent conversion.
Findings
Achieves state-of-the-art accent similarity and speech quality.
Operates effectively with only 15 minutes of weakly parallel data.
Demonstrates high adaptability across diverse accents.
Abstract
Low resource of parallel data is the key challenge of accent conversion(AC) problem in which both the pronunciation units and prosody pattern need to be converted. We propose a two-stage generative framework "convert-and-speak" in which the conversion is only operated on the semantic token level and the speech is synthesized conditioned on the converted semantic token with a speech generative model in target accent domain. The decoupling design enables the "speaking" module to use massive amount of target accent speech and relieves the parallel data required for the "conversion" module. Conversion with the bridge of semantic token also relieves the requirement for the data with text transcriptions and unlocks the usage of language pre-training technology to further efficiently reduce the need of parallel accent speech data. To reduce the complexity and latency of "speaking", a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
