Masked Self-distilled Transducer-based Keyword Spotting with Semi-autoregressive Decoding
Yu Xi, Xiaoyu Gu, Haoyu Li, Jun Song, Bo Zheng, Kai Yu

TL;DR
This paper introduces a novel training and decoding strategy for RNN-T-based keyword spotting that reduces overfitting and combines the strengths of autoregressive and non-autoregressive methods, leading to improved performance.
Contribution
It proposes a masked self-distillation training method and a semi-autoregressive decoding approach to enhance RNN-T keyword spotting models.
Findings
MSD training alleviates overfitting in RNN-T KWS models.
SAR decoding combines AR and NAR advantages, improving accuracy.
Experimental results show state-of-the-art performance across multiple datasets.
Abstract
RNN-T-based keyword spotting (KWS) with autoregressive decoding~(AR) has gained attention due to its streaming architecture and superior performance. However, the simplicity of the prediction network in RNN-T poses an overfitting issue, especially under challenging scenarios, resulting in degraded performance. In this paper, we propose a masked self-distillation (MSD) training strategy that avoids RNN-Ts overly relying on prediction networks to alleviate overfitting. Such training enables masked non-autoregressive (NAR) decoding, which fully masks the RNN-T predictor output during KWS decoding. In addition, we propose a semi-autoregressive (SAR) decoding approach to integrate the advantages of AR and NAR decoding. Our experiments across multiple KWS datasets demonstrate that MSD training effectively alleviates overfitting. The SAR decoding method preserves the superior performance of AR…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques
