Bifocal Neural ASR: Exploiting Keyword Spotting for Inference Optimization
Jonathan Macoskey, Grant P. Strimel, Ariya Rastrow

TL;DR
This paper introduces Bifocal RNN-T, a novel speech recognition architecture that leverages keyword spotting to optimize inference latency, achieving significant cost reductions while maintaining accuracy.
Contribution
The paper proposes Bifocal RNN-T with Bifocal LSTM, enabling dynamic computation pathways based on keyword spotting for improved inference efficiency.
Findings
Achieves 29.1% reduction in inference cost
Maintains comparable word error rates
Compatible with quantization and sparsification techniques
Abstract
We present Bifocal RNN-T, a new variant of the Recurrent Neural Network Transducer (RNN-T) architecture designed for improved inference time latency on speech recognition tasks. The architecture enables a dynamic pivot for its runtime compute pathway, namely taking advantage of keyword spotting to select which component of the network to execute for a given audio frame. To accomplish this, we leverage a recurrent cell we call the Bifocal LSTM (BFLSTM), which we detail in the paper. The architecture is compatible with other optimization strategies such as quantization, sparsification, and applying time-reduction layers, making it especially applicable for deployed, real-time speech recognition settings. We present the architecture and report comparative experimental results on voice-assistant speech recognition tasks. Specifically, we show our proposed Bifocal RNN-T can improve inference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory
