Pairwise Evaluation of Accent Similarity in Speech Synthesis

Jinzuomu Zhong; Suyuan Liu; Dan Wells; Korin Richmond

arXiv:2505.14410·eess.AS·February 6, 2026

Pairwise Evaluation of Accent Similarity in Speech Synthesis

Jinzuomu Zhong, Suyuan Liu, Dan Wells, Korin Richmond

PDF

Open Access

TL;DR

This paper develops improved subjective and objective evaluation methods for accent similarity in speech synthesis, including a refined listening test and pronunciation metrics, revealing limitations of common metrics like WER.

Contribution

It introduces a new accent similarity evaluation framework combining enhanced listening tests and pronunciation metrics, addressing gaps in current assessment methods.

Findings

01

Refined XAB listening test with higher significance and lower cost

02

Pronunciation metrics based on vowel formants and phonetic posteriorgrams effective for evaluation

03

Common metrics like WER have significant limitations for underrepresented accents

Abstract

Despite growing interest in generating high-fidelity accents, evaluating accent similarity in speech synthesis has been underexplored. We aim to enhance both subjective and objective evaluation methods for accent similarity. Subjectively, we refine the XAB listening test by adding components that achieve higher statistical significance with fewer listeners and lower costs. Our method involves providing listeners with transcriptions, having them highlight perceived accent differences, and implementing meticulous screening for reliability. Objectively, we utilise pronunciation-related metrics, based on distances between vowel formants and phonetic posteriorgrams, to evaluate accent generation. Comparative experiments reveal that these metrics, alongside accent similarity, speaker similarity, and Mel Cepstral Distortion, can be used. Moreover, our findings underscore significant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems