Comparing Discrete and Continuous Space LLMs for Speech Recognition

Yaoxun Xu; Shi-Xiong Zhang; Jianwei Yu; Zhiyong Wu; Dong Yu

arXiv:2409.00800·cs.CL·September 4, 2024

Comparing Discrete and Continuous Space LLMs for Speech Recognition

Yaoxun Xu, Shi-Xiong Zhang, Jianwei Yu, Zhiyong Wu, Dong Yu

PDF

Open Access

TL;DR

This paper provides the first comprehensive comparison of discrete and continuous speech representations in LLM-based ASR, analyzing various training methods and model architectures to improve speech recognition accuracy.

Contribution

It introduces a detailed classification of speech representations and models, and presents an open-source approach achieving state-of-the-art WER on LibriSpeech.

Findings

01

Achieved a WER of 1.69% on LibriSpeech with HuBERT encoder

02

Provided a comparative analysis of discrete vs. continuous speech representations

03

First extensive study of speech representations in LLM-based ASR

Abstract

This paper investigates discrete and continuous speech representations in Large Language Model (LLM)-based Automatic Speech Recognition (ASR), organizing them by feature continuity and training approach into four categories: supervised and unsupervised for both discrete and continuous types. We further classify LLMs based on their input and autoregressive feedback into continuous and discrete-space models. Using specialized encoders and comparative analysis with a Joint-Training-From-Scratch Language Model (JTFS LM) and pre-trained LLaMA2-7b, we provide a detailed examination of their effectiveness. Our work marks the first extensive comparison of speech representations in LLM-based ASR and explores various modeling techniques. We present an open-sourced achievement of a state-of-the-art Word Error Rate (WER) of 1.69\% on LibriSpeech using a HuBERT encoder, offering valuable insights…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing