Decoder-only Architecture for Streaming End-to-end Speech Recognition

Emiru Tsunoo; Hayato Futami; Yosuke Kashiwagi; Siddhant Arora; Shinji; Watanabe

arXiv:2406.16107·eess.AS·August 2, 2024

Decoder-only Architecture for Streaming End-to-end Speech Recognition

Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, Shinji, Watanabe

PDF

Open Access

TL;DR

This paper introduces a decoder-only architecture for streaming end-to-end speech recognition, leveraging blockwise processing and a novel training scheme to improve accuracy and speed.

Contribution

It proposes a new decoder-only model with blockwise speech processing and a training scheme for robust streaming ASR, achieving better accuracy and efficiency.

Findings

01

8% relative WER reduction on LibriSpeech test-other

02

Twice as fast as baseline models

03

Effective robustness to truncated prompts

Abstract

Decoder-only language models (LMs) have been successfully adopted for speech-processing tasks including automatic speech recognition (ASR). The LMs have ample expressiveness and perform efficiently. This efficiency is a suitable characteristic for streaming applications of ASR. In this work, we propose to use a decoder-only architecture for blockwise streaming ASR. In our approach, speech features are compressed using CTC output and context embedding using blockwise speech subnetwork, and are sequentially provided as prompts to the decoder. The decoder estimates the output tokens promptly at each block. To this end, we also propose a novel training scheme using random-length prefix prompts to make the model robust to the truncated prompts caused by blockwise processing. An experimental comparison shows that our proposed decoder-only streaming ASR achieves 8% relative word error rate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Compression Techniques · Speech Recognition and Synthesis · Speech and Audio Processing

MethodsSparse Evolutionary Training