Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASR

Mingyu Cui; Yifan Yang; Jiajun Deng; Jiawen Kang; Shujie Hu; Tianzi Wang; Zhaoqing Li; Shiliang Zhang; Xie Chen; Xunying Liu

arXiv:2409.08797·cs.CL·June 11, 2025

Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASR

Mingyu Cui, Yifan Yang, Jiajun Deng, Jiawen Kang, Shujie Hu, Tianzi Wang, Zhaoqing Li, Shiliang Zhang, Xie Chen, Xunying Liu

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates that using SSL discrete speech features as cross-utterance context in Zipformer-Transducer ASR systems improves word error rates on the Gigaspeech corpus, achieving state-of-the-art results.

Contribution

It introduces the use of SSL discrete speech features from WavLM models as cross-utterance context in Zipformer-Transducer ASR, showing significant WER improvements.

Findings

01

Discrete token features outperform Fbank features for context modeling.

02

Significant WER reductions of 0.32% to 0.41% absolute achieved.

03

Lowest published WERs of 11.15% and 11.14% on dev and test sets.

Abstract

Self-supervised learning (SSL) based discrete speech representations are highly compact and domain adaptable. In this paper, SSL discrete speech features extracted from WavLM models are used as additional cross-utterance acoustic context features in Zipformer-Transducer ASR systems. The efficacy of replacing Fbank features with discrete token features for modelling either cross-utterance contexts (from preceding and future segments), or current utterance's internal contexts alone, or both at the same time, are demonstrated thoroughly on the Gigaspeech 1000-hr corpus. The best Zipformer-Transducer system using discrete tokens based cross-utterance context features outperforms the baseline using utterance internal context only with statistically significant word error rate (WER) reductions of 0.32% to 0.41% absolute (2.78% to 3.54% relative) on the dev and test data. The lowest published…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

open-creator/icefall
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis