Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of   Language Models

Bolaji Yusuf; Murali Karthick Baskar; Andrew Rosenberg; Bhuvana; Ramabhadran

arXiv:2407.04641·eess.AS·July 8, 2024

Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of Language Models

Bolaji Yusuf, Murali Karthick Baskar, Andrew Rosenberg, Bhuvana, Ramabhadran

PDF

Open Access

TL;DR

This paper introduces a novel speculative speech recognition method that combines an RNN-Transducer ASR system with an audio-prefixed language model to enable the recognizer to anticipate speech, reducing latency.

Contribution

It proposes a new model for speculative speech recognition using audio-prefixed language models and introduces a metric for evaluating SSR performance.

Findings

01

The proposed method effectively reduces ASR latency.

02

Experimental results demonstrate the feasibility of SSR across various datasets.

03

The model improves real-time speech recognition capabilities.

Abstract

This paper explores speculative speech recognition (SSR), where we empower conventional automatic speech recognition (ASR) with speculation capabilities, allowing the recognizer to run ahead of audio. We introduce a metric for measuring SSR performance and we propose a model which does SSR by combining a RNN-Transducer-based ASR system with an audio-prefixed language model (LM). The ASR system transcribes ongoing audio and feeds the resulting transcripts, along with an audio-dependent prefix, to the LM, which speculates likely completions for the transcriptions. We experiment with a variety of ASR datasets on which show the efficacy our method and the feasibility of SSR as a method of reducing ASR latency.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing