Pairwise Evaluation of Accent Similarity in Speech Synthesis
Jinzuomu Zhong, Suyuan Liu, Dan Wells, Korin Richmond

TL;DR
This paper develops improved subjective and objective evaluation methods for accent similarity in speech synthesis, including a refined listening test and pronunciation metrics, revealing limitations of common metrics like WER.
Contribution
It introduces a new accent similarity evaluation framework combining enhanced listening tests and pronunciation metrics, addressing gaps in current assessment methods.
Findings
Refined XAB listening test with higher significance and lower cost
Pronunciation metrics based on vowel formants and phonetic posteriorgrams effective for evaluation
Common metrics like WER have significant limitations for underrepresented accents
Abstract
Despite growing interest in generating high-fidelity accents, evaluating accent similarity in speech synthesis has been underexplored. We aim to enhance both subjective and objective evaluation methods for accent similarity. Subjectively, we refine the XAB listening test by adding components that achieve higher statistical significance with fewer listeners and lower costs. Our method involves providing listeners with transcriptions, having them highlight perceived accent differences, and implementing meticulous screening for reliability. Objectively, we utilise pronunciation-related metrics, based on distances between vowel formants and phonetic posteriorgrams, to evaluate accent generation. Comparative experiments reveal that these metrics, alongside accent similarity, speaker similarity, and Mel Cepstral Distortion, can be used. Moreover, our findings underscore significant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems
