Optimizing Contextual Speech Recognition Using Vector Quantization for   Efficient Retrieval

Nikolaos Flemotomos; Roger Hsiao; Pawel Swietojanski; Takaaki Hori,; Dogan Can; Xiaodan Zhuang

arXiv:2411.00664·eess.AS·November 5, 2024

Optimizing Contextual Speech Recognition Using Vector Quantization for Efficient Retrieval

Nikolaos Flemotomos, Roger Hsiao, Pawel Swietojanski, Takaaki Hori,, Dogan Can, Xiaodan Zhuang

PDF

Open Access

TL;DR

This paper introduces a vector quantization-based approximation for cross-attention in neural speech recognition, enabling efficient use of large biasing catalogues and significantly improving accuracy and computational efficiency.

Contribution

It proposes a novel approximation method for cross-attention using vector quantization, allowing large-scale biasing catalogues to be used efficiently in speech recognition.

Findings

01

Up to 71% relative error rate reduction in personal entity recognition.

02

20% reduction in compute time for large biasing lists.

03

85-95% reduction in memory usage with the proposed method.

Abstract

Neural contextual biasing allows speech recognition models to leverage contextually relevant information, leading to improved transcription accuracy. However, the biasing mechanism is typically based on a cross-attention module between the audio and a catalogue of biasing entries, which means computational complexity can pose severe practical limitations on the size of the biasing catalogue and consequently on accuracy improvements. This work proposes an approximation to cross-attention scoring based on vector quantization and enables compute- and memory-efficient use of large biasing catalogues. We propose to use this technique jointly with a retrieval based contextual biasing approach. First, we use an efficient quantized retrieval module to shortlist biasing entries by grounding them on audio. Then we use retrieved entries for biasing. Since the proposed approach is agnostic to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsConcatenated Skip Connection · Softmax