Acoustic and perceptual differences between standard and accented Chinese speech and their voice clones
Tianle Yang, Chengzhe Sun, Phil Rose, Siwei Lyu

TL;DR
This study investigates how accent differences between standard and accented Chinese speech affect voice cloning quality and perception, revealing that accent influences perceived identity and intelligibility.
Contribution
It demonstrates that accent variation impacts voice clone perception and suggests evaluating speaker identity and accent preservation separately.
Findings
Clones are rated more similar to originals for standard speech.
Intelligibility improves from original to clone, especially for accented speech.
Accent variation influences perceived identity and intelligibility in voice cloning.
Abstract
Voice cloning is often evaluated in terms of overall quality, but less is known about accent preservation and its perceptual consequences. We compare standard and heavily accented Mandarin speech and their voice clones using a combined computational and perceptual design. Embedding-based analyses show no reliable accented-standard difference in original-clone distances across systems. In the perception study, clones are rated as more similar to their originals for standard than for accented speakers, and intelligibility increases from original to clone, with a larger gain for accented speech. These results show that accent variation can shape perceived identity match and intelligibility in voice cloning even when it is not reflected in an off-the-shelf speaker-embedding distance, and they motivate evaluating speaker identity preservation and accent preservation as separable dimensions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
