Rethinking Discrete Speech Representation Tokens for Accent Generation
Jinzuomu Zhong, Yi Wang, Korin Richmond, Peter Bell

TL;DR
This paper systematically investigates how accent information is encoded in Discrete Speech Representation Tokens (DSRTs), revealing key factors affecting accent retention and the impact of supervision and codebook size on accent disentanglement.
Contribution
It introduces a unified evaluation framework for assessing accent information in DSRTs and provides insights into how layer choice, supervision, and codebook size influence accent encoding.
Findings
Layer choice significantly affects accent retention.
ASR supervision reduces accent information.
Codebook size reduction does not effectively disentangle accent.
Abstract
Discrete Speech Representation Tokens (DSRTs) have become a foundational component in speech generation. While prior work has extensively studied phonetic and speaker information in DSRTs, how accent information is encoded in DSRTs remains largely unexplored. In this paper, we present the first systematic investigation of accent information in DSRTs. We propose a unified evaluation framework that measures both accessibility of accent information via a novel Accent ABX task and recoverability via cross-accent Voice Conversion (VC) resynthesis. Using this framework, we analyse DSRTs derived from several widely used speech representations. Our results reveal that: (1) choice of layers has the most significant impact on retaining accent information, (2) accent information is substantially reduced by ASR supervision; (3) naive codebook size reduction cannot effectively disentangle accent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Voice and Speech Disorders
