Frontend Token Enhancement for Token-Based Speech Recognition
Takanori Ashihara, Shota Horiguchi, Kohei Matsuura, Tsubasa Ochiai, Marc Delcroix

TL;DR
This paper proposes a frontend system that enhances noisy speech tokens to improve speech recognition accuracy, demonstrating that wave-to-token models outperform other methods and often surpass continuous SSL feature-based ASR on CHiME-4.
Contribution
Introduces a novel frontend token enhancement system for noisy speech that improves ASR performance, with wave-to-token models showing superior results.
Findings
Wave-to-token enhancement achieves the best performance.
It outperforms continuous SSL feature-based ASR.
The system is trained independently of ASR backends.
Abstract
Discretized representations of speech signals are efficient alternatives to continuous features for various speech applications, including automatic speech recognition (ASR) and speech language models. However, these representations, such as semantic or phonetic tokens derived from clustering outputs of self-supervised learning (SSL) speech models, are susceptible to environmental noise, which can degrade backend task performance. In this work, we introduce a frontend system that estimates clean speech tokens from noisy speech and evaluate it on an ASR backend using semantic tokens. We consider four types of enhancement models based on their input/output domains: wave-to-wave, token-to-token, continuous SSL features-to-token, and wave-to-token. These models are trained independently of ASR backends. Experiments on the CHiME-4 dataset demonstrate that wave-to-token enhancement achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research
