CaTT-KWS: A Multi-stage Customized Keyword Spotting Framework based on Cascaded Transducer-Transformer
Zhanheng Yang, Sining Sun, Jin Li, Xiaoming Zhang, Xiong Wang, Long, Ma, Lei Xie

TL;DR
This paper introduces CaTT-KWS, a multi-stage keyword spotting framework combining transducer and transformer models to significantly reduce false alarms while maintaining high recognition accuracy for edge device deployment.
Contribution
The paper proposes a novel multi-stage KWS framework that integrates transducer-based detection, phone-level force alignment, and transformer decoding, improving false alarm reduction over existing models.
Findings
Achieves 0.13 false alarms per hour on a challenging dataset.
Reduces false alarms by over 90% relative to transducer-only detection.
Maintains keyword recognition accuracy with less than 2% drop.
Abstract
Customized keyword spotting (KWS) has great potential to be deployed on edge devices to achieve hands-free user experience. However, in real applications, false alarm (FA) would be a serious problem for spotting dozens or even hundreds of keywords, which drastically affects user experience. To solve this problem, in this paper, we leverage the recent advances in transducer and transformer based acoustic models and propose a new multi-stage customized KWS framework named Cascaded Transducer-Transformer KWS (CaTT-KWS), which includes a transducer based keyword detector, a frame-level phone predictor based force alignment module and a transformer based decoder. Specifically, the streaming transducer module is used to spot keyword candidates in audio stream. Then force alignment is implemented using the phone posteriors predicted by the phone predictor to finish the first stage keyword…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
