Frontend Token Enhancement for Token-Based Speech Recognition

Takanori Ashihara; Shota Horiguchi; Kohei Matsuura; Tsubasa Ochiai; Marc Delcroix

arXiv:2602.04217·cs.SD·February 5, 2026

Frontend Token Enhancement for Token-Based Speech Recognition

Takanori Ashihara, Shota Horiguchi, Kohei Matsuura, Tsubasa Ochiai, Marc Delcroix

PDF

Open Access

TL;DR

This paper proposes a frontend system that enhances noisy speech tokens to improve speech recognition accuracy, demonstrating that wave-to-token models outperform other methods and often surpass continuous SSL feature-based ASR on CHiME-4.

Contribution

Introduces a novel frontend token enhancement system for noisy speech that improves ASR performance, with wave-to-token models showing superior results.

Findings

01

Wave-to-token enhancement achieves the best performance.

02

It outperforms continuous SSL feature-based ASR.

03

The system is trained independently of ASR backends.

Abstract

Discretized representations of speech signals are efficient alternatives to continuous features for various speech applications, including automatic speech recognition (ASR) and speech language models. However, these representations, such as semantic or phonetic tokens derived from clustering outputs of self-supervised learning (SSL) speech models, are susceptible to environmental noise, which can degrade backend task performance. In this work, we introduce a frontend system that estimates clean speech tokens from noisy speech and evaluate it on an ASR backend using semantic tokens. We consider four types of enhancement models based on their input/output domains: wave-to-wave, token-to-token, continuous SSL features-to-token, and wave-to-token. These models are trained independently of ASR backends. Experiments on the CHiME-4 dataset demonstrate that wave-to-token enhancement achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research