Very Fast Keyword Spotting System with Real Time Factor below 0.01
Jan Nouza, Petr Cerva, Jindrich Zdansky

TL;DR
This paper introduces a highly optimized neural network-based keyword spotting system capable of real-time operation with a factor below 0.01, suitable for various speech data types.
Contribution
The paper presents a novel, highly efficient architecture for keyword spotting using bidirectional feedforward networks and forward decoding, achieving unprecedented speed.
Findings
RT factor close to 0.001 with all optimizations
Effective on diverse Czech speech datasets
Outperforms previous systems in speed and efficiency
Abstract
In the paper we present an architecture of a keyword spotting (KWS) system that is based on modern neural networks, yields good performance on various types of speech data and can run very fast. We focus mainly on the last aspect and propose optimizations for all the steps required in a KWS design: signal processing and likelihood computation, Viterbi decoding, spot candidate detection and confidence calculation. We present time and memory efficient modelling by bidirectional feedforward sequential memory networks (an alternative to recurrent nets) either by standard triphones or so called quasi-monophones, and an entirely forward decoding of speech frames (with minimal need for look back). Several variants of the proposed scheme are evaluated on 3 large Czech datasets (broadcast, internet and telephone, 17 hours in total) and their performance is compared by Detection Error Tradeoff…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
