Explainable Disentanglement on Discrete Speech Representations for Noise-Robust ASR

Shreyas Gopal; Ashutosh Anshul; Haoyang Li; Yue Heng Yeo; Hexin Liu; Eng Siong Chng

arXiv:2510.25150·cs.CL·October 30, 2025

Explainable Disentanglement on Discrete Speech Representations for Noise-Robust ASR

Shreyas Gopal, Ashutosh Anshul, Haoyang Li, Yue Heng Yeo, Hexin Liu, Eng Siong Chng

PDF

TL;DR

This paper introduces a method to disentangle speech content from background noise in discrete speech representations, enhancing noise robustness and ASR accuracy by separating clean speech tokens from noise vectors.

Contribution

It proposes a novel end-to-end model that separates semantic speech content from noise in latent space, improving noise invariance and ASR performance without retraining the Whisper model.

Findings

01

82% reduction in error rate over Whisper

02

35% improvement over baseline methods

03

Generalizes well to unseen acoustic conditions

Abstract

Discrete audio representations are gaining traction in speech modeling due to their interpretability and compatibility with large language models, but are not always optimized for noisy or real-world environments. Building on existing works that quantize Whisper embeddings for speech-to-unit modeling, we propose disentangling semantic speech content from background noise in the latent space. Our end-to-end model separates clean speech in the form of codebook tokens, while extracting interpretable noise vectors as quantization residue which are supervised via a lightweight classifier. We show that our approach improves alignment between clean/noisy speech and text, producing speech tokens that display a high degree of noiseinvariance, and improves ASR performance. Keeping Whisper frozen, we show an 82% reduction in error rate compared to Whisper, and 35% improvement over baseline methods…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.