Phone-purity Guided Discrete Tokens for Dysarthric Speech Recognition

Huimeng Wang; Xurong Xie; Mengzhe Geng; Shujie Hu; Haoning Xu; Youjun; Chen; Zhaoqing Li; Jiajun Deng; Xunying Liu

arXiv:2501.04379·cs.SD·January 9, 2025

Phone-purity Guided Discrete Tokens for Dysarthric Speech Recognition

Huimeng Wang, Xurong Xie, Mengzhe Geng, Shujie Hu, Haoning Xu, Youjun, Chen, Zhaoqing Li, Jiajun Deng, Xunying Liu

PDF

Open Access

TL;DR

This paper introduces phone-purity guided discrete tokens for dysarthric speech recognition, enhancing phonetic discrimination and significantly reducing word error rates compared to traditional token extraction methods.

Contribution

It proposes a novel phone-purity guided approach that regularizes discrete token extraction, improving recognition accuracy for disordered speech.

Findings

01

PPG discrete tokens outperform non-PPG tokens in WER reduction

02

Significant WER improvements up to 1.77% absolute and 4.82% relative

03

Sharper decision boundaries in token clustering demonstrated by visualization

Abstract

Discrete tokens extracted provide efficient and domain adaptable speech features. Their application to disordered speech that exhibits articulation imprecision and large mismatch against normal voice remains unexplored. To improve their phonetic discrimination that is weakened during unsupervised K-means or vector quantization of continuous features, this paper proposes novel phone-purity guided (PPG) discrete tokens for dysarthric speech recognition. Phonetic label supervision is used to regularize maximum likelihood and reconstruction error costs used in standard K-means and VAE-VQ based discrete token extraction. Experiments conducted on the UASpeech corpus suggest that the proposed PPG discrete token features extracted from HuBERT consistently outperform hybrid TDNN and End-to-End (E2E) Conformer systems using non-PPG based K-means or VAE-VQ tokens across varying codebook sizes by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Speech and Audio Processing