Masked Self-distilled Transducer-based Keyword Spotting with Semi-autoregressive Decoding

Yu Xi; Xiaoyu Gu; Haoyu Li; Jun Song; Bo Zheng; Kai Yu

arXiv:2505.24820·cs.SD·June 2, 2025

Masked Self-distilled Transducer-based Keyword Spotting with Semi-autoregressive Decoding

Yu Xi, Xiaoyu Gu, Haoyu Li, Jun Song, Bo Zheng, Kai Yu

PDF

Open Access

TL;DR

This paper introduces a novel training and decoding strategy for RNN-T-based keyword spotting that reduces overfitting and combines the strengths of autoregressive and non-autoregressive methods, leading to improved performance.

Contribution

It proposes a masked self-distillation training method and a semi-autoregressive decoding approach to enhance RNN-T keyword spotting models.

Findings

01

MSD training alleviates overfitting in RNN-T KWS models.

02

SAR decoding combines AR and NAR advantages, improving accuracy.

03

Experimental results show state-of-the-art performance across multiple datasets.

Abstract

RNN-T-based keyword spotting (KWS) with autoregressive decoding~(AR) has gained attention due to its streaming architecture and superior performance. However, the simplicity of the prediction network in RNN-T poses an overfitting issue, especially under challenging scenarios, resulting in degraded performance. In this paper, we propose a masked self-distillation (MSD) training strategy that avoids RNN-Ts overly relying on prediction networks to alleviate overfitting. Such training enables masked non-autoregressive (NAR) decoding, which fully masks the RNN-T predictor output during KWS decoding. In addition, we propose a semi-autoregressive (SAR) decoding approach to integrate the advantages of AR and NAR decoding. Our experiments across multiple KWS datasets demonstrate that MSD training effectively alleviates overfitting. The SAR decoding method preserves the superior performance of AR…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Text Analysis Techniques