Contextual Biasing with the Knuth-Morris-Pratt Matching Algorithm
Weiran Wang, Zelin Wu, Diamantino Caseiro, Tsendsuren Munkhdalai, Khe, Chai Sim, Pat Rondon, Golan Pundak, Gan Song, Rohit Prabhavalkar, Zhong Meng,, Ding Zhao, Tara Sainath, Pedro Moreno Mengibar

TL;DR
This paper introduces a novel algorithm for contextual biasing in speech recognition using the KMP pattern matching algorithm, improving accuracy without extra model parameters.
Contribution
It proposes a KMP-based pattern matching approach for contextual biasing that enhances speech recognition accuracy efficiently on TPUs.
Findings
Significant WER reduction on biasing test sets
Compatible with model-based biasing methods
Efficient memory and computation performance
Abstract
Contextual biasing refers to the problem of biasing the automatic speech recognition (ASR) systems towards rare entities that are relevant to the specific user or application scenarios. We propose algorithms for contextual biasing based on the Knuth-Morris-Pratt algorithm for pattern matching. During beam search, we boost the score of a token extension if it extends matching into a set of biasing phrases. Our method simulates the classical approaches often implemented in the weighted finite state transducer (WFST) framework, but avoids the FST language altogether, with careful considerations on memory footprint and efficiency on tensor processing units (TPUs) by vectorization. Without introducing additional model parameters, our method achieves significant word error rate (WER) reductions on biasing test sets by itself, and yields further performance gain when combined with a…
Peer Reviews
Decision·Submitted to ICLR 2024
The paper clearly demonstrates the applicability of KMP's efficiency on the biasing task. The paper reads well and is easy to follow.
1. The paper lacks adequate references to the related work in the area of the contextual biasing for ASR Models. I have added some relevant citations. 2. Lack of comparison on publicly available data and models limiting reproducibility. Le at al 2021a [2] provides an open protocol for evaluating on librispeech corpus (https://github.com/facebookresearch/fbai-speech/tree/main/is21_deep_bias). 3. Lack of comparison with baselines, how well does this model compare against a simple shallow fusio
They show how to apply the KMP algorithm inside beam search to boost the biasing phrases. The experiments show that the proposed KMP-based algorithm gives nice improvements when used together with NAM in the setting biasing.
The topic is very ASR specific. I'm not sure if the broader ICLR community is interested in this, and some conference like Interspeech or ICASSP would be a better fit? In principle, the method could be applied for other tasks, for example for machine translation. However, this is not investigated here. I think this would make it a better fit for ICLR. It is explained that the proposed method is conceptually similar (or the same?) as WFST-based approaches. However, in the experiments, it is not
1. Contextual biasing in ASR is an important application. As mentioned by the authors (Section 3), biasing can be done at the decoding level or at the model level — the proposed KMP algorithm operates at the former, showing good WER improvements. It is also shown to be complementary to model-based biasing (using NAM). 2. The authors have discussed the parallelized time and space complexities of the proposed methods wherever applicable. 3. On the Contact-Tag data, the WER is improved from 14.7%
### Motivations misaligned with application and results The main objective of the paper is to build a contextual biasing system that is efficient to decode on large-scale parallelizable infrastructure such as TPUs. However, in the introduction and the experiments, the application of the method is for recognizing contact names for voice assistants. In my understanding, such voice assistants are commonly placed on the edge device, which does not usually have built-in TPUs. As such, it is hard to
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
