CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation
Crystal Min Hui Poon, Pai Chet Ng, Xiaoxiao Miao, Immanuel Jun Kai Loh, Bowen Zhang, Haoyu Song, Ian Mcloughlin

TL;DR
CLARITY is a novel framework that reduces accent and linguistic biases in text-to-speech systems by localizing text to dialects and retrieving accent prompts, improving accent accuracy and fairness.
Contribution
It introduces a dual-signal optimization approach combining contextual linguistic adaptation and retrieval-augmented accent prompting for inclusive TTS.
Findings
Improves accent accuracy across twelve English varieties.
Enhances fairness and perceptual quality of generated speech.
Demonstrates effectiveness through subjective and objective evaluations.
Abstract
Instruction-guided text-to-speech (TTS) research has reached a maturity level where excellent speech generation quality is possible on demand, yet two coupled biases persist in reducing perceived quality: accent bias, where models default towards dominant phonetic patterns, and linguistic bias, a misalignment in dialect-specific lexical or cultural information. These biases are interdependent and authentic accent generation requires both accent fidelity and correctly localized text. We present CLARITY (Contextual Linguistic Adaptation and Retrieval for Inclusive TTS sYnthesis), a backbone-agnostic framework to address both biases through dual-signal optimization. Firstly, we apply contextual linguistic adaptation to localize input text to align with the target dialect. Secondly, we propose retrieval-augmented accent prompting (RAAP) to ensure accent-consistent speech prompts. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPhonetics and Phonology Research · Speech Recognition and Synthesis · Natural Language Processing Techniques
