Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech   Units: A Pilot Study

Peikun Chen; Sining Sun; Changhao Shan; Qing Yang; Lei Xie

arXiv:2406.18862·cs.SD·June 28, 2024

Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study

Peikun Chen, Sining Sun, Changhao Shan, Qing Yang, Lei Xie

PDF

Open Access 1 Repo

TL;DR

This paper presents a novel streaming decoder-only speech recognition model that uses discrete speech units, boundary tokens, and advanced attention mechanisms, achieving competitive results on AISHELL datasets.

Contribution

The study introduces a streaming-capable decoder-only ASR model with boundary tokens and right-chunk attention, tailored for real-time speech recognition tasks.

Findings

01

Achieves competitive performance with non-streaming models on AISHELL datasets.

02

Employs boundary tokens and right-chunk attention to enhance streaming recognition.

03

Utilizes data augmentation techniques to improve contextual modeling.

Abstract

Unified speech-text models like SpeechGPT, VioLA, and AudioPaLM have shown impressive performance across various speech-related tasks, especially in Automatic Speech Recognition (ASR). These models typically adopt a unified method to model discrete speech and text tokens, followed by training a decoder-only transformer. However, they are all designed for non-streaming ASR tasks, where the entire speech utterance is needed during decoding. Hence, we introduce a decoder-only model exclusively designed for streaming recognition, incorporating a dedicated boundary token to facilitate streaming recognition and employing causal attention masking during the training phase. Furthermore, we introduce right-chunk attention and various data augmentation techniques to improve the model's contextual modeling abilities. While achieving streaming speech recognition, experiments on the AISHELL-1 and -2…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chenpk00/IS2024_stream_decoder_only_asr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis