Speech ReaLLM -- Real-time Streaming Speech Recognition with Multimodal   LLMs by Teaching the Flow of Time

Frank Seide; Morrie Doulaty; Yangyang Shi; Yashesh Gaur; Junteng Jia,; Chunyang Wu

arXiv:2406.09569·cs.CL·June 17, 2024·1 cites

Speech ReaLLM -- Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time

Frank Seide, Morrie Doulaty, Yangyang Shi, Yashesh Gaur, Junteng Jia,, Chunyang Wu

PDF

Open Access

TL;DR

Speech ReaLLM introduces a novel real-time, decoder-only speech recognition architecture that effectively handles continuous audio streams, enabling multimodal LLMs to process speech in real time without explicit endpointing.

Contribution

It presents the first decoder-only ASR architecture capable of real-time streaming and introduces the ReaLLM approach for continuous audio processing with LLMs.

Findings

01

Achieves 3.0% and 7.4% WER on Librispeech test set in real time

02

Performs comparably to larger attention-based models without external LM

03

Pre-trained 7B LLM can be fine-tuned for speech recognition tasks

Abstract

We introduce Speech ReaLLM, a new ASR architecture that marries "decoder-only" ASR with the RNN-T to make multimodal LLM architectures capable of real-time streaming. This is the first "decoder-only" ASR architecture designed to handle continuous audio without explicit end-pointing. Speech ReaLLM is a special case of the more general ReaLLM ("real-time LLM") approach, also introduced here for the first time. The idea is inspired by RNN-T: Instead of generating a response only at the end of a user prompt, generate after every input token received in real time (it is often empty). On Librispeech "test", an 80M Speech ReaLLM achieves WERs of 3.0% and 7.4% in real time (without an external LM or auxiliary loss). This is only slightly above a 3x larger Attention-Encoder-Decoder baseline. We also show that this way, an LLM architecture can learn to represent and reproduce the flow of time;…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems