PSP: An Interpretable Per-Dimension Accent Benchmark for Indic Text-to-Speech
Venkata Pushpak Teja Menta

TL;DR
This paper introduces PSP, an interpretable benchmark for evaluating accent features in Indic TTS systems across six dimensions, revealing insights into accent fidelity and system trade-offs.
Contribution
It proposes a novel per-phonological-dimension accent benchmark for Indic TTS, decomposing accent into six interpretable metrics and benchmarking multiple systems.
Findings
Retroflex collapse increases with phonological difficulty: Hindi < Telugu < Tamil.
PSP ordering differs from WER-based ordering, highlighting accent nuances.
No single system excels across all six accent dimensions.
Abstract
Standard text-to-speech (TTS) evaluation measures intelligibility (WER, CER) and overall naturalness (MOS, UTMOS) but does not quantify accent. A synthesiser may score well on all four yet sound non-native on features that are phonemic in the target language. For Indic languages, these features include retroflex articulation, aspiration, vowel length, and the Tamil retroflex approximant (letter zha). We present PSP, the Phoneme Substitution Profile, an interpretable, per-phonological-dimension accent benchmark for Indic TTS. PSP decomposes accent into six complementary dimensions: retroflex collapse rate (RR), aspiration fidelity (AF), vowel-length fidelity (LF), Tamil-zha fidelity (ZF), Frechet Audio Distance (FAD), and prosodic signature divergence (PSD). The first four are measured via forced alignment plus native-speaker-centroid acoustic probes over Wav2Vec2-XLS-R layer-9…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
