Large Language Model Guided Decoding for Self-Supervised Speech Recognition
Eyal Cohen (1), Bhiksha Raj (2), Joseph Keshet (1) ((1) Technion - Israel Institute of Technology, (2) Carnegie Mellon University)

TL;DR
This paper introduces a novel decoding method for self-supervised speech recognition that effectively integrates large language models with acoustic models, improving transcription accuracy especially for complex and domain-specific speech inputs.
Contribution
The paper presents a new LLM-guided decoding approach that combines LLM predictions with SSL acoustic scores, outperforming existing methods in challenging speech recognition scenarios.
Findings
Outperforms current state-of-the-art LLM-based decoding methods.
Effective on complex sentences, acronyms, and domain-specific vocabulary.
Improves transcription accuracy in challenging speech inputs.
Abstract
Self-supervised automatic speech recognition (SSL-ASR) is an ASR approach that uses speech encoders pretrained on large amounts of unlabeled audio (e.g., wav2vec2.0 or HuBERT) and then fine-tunes them with limited labeled data to perform transcription. Decoding is usually performed with a CTC decoder, whose hypotheses are scored and refined using an external language model (LM), typically an n-gram or neural LM, which guides beam search to produce the final transcription. Using Large Language Models (LLMs) as external LMs remains a challenge, as their word probabilities are overly confident. The proposed method integrates an LLM with an SSL acoustic model by using the LLM's decoding mechanism to generate a set of candidate next tokens. For each candidate, the SSL model provides an acoustic score by aligning it to the input acoustics of the SSL model. A combined acoustic and LLM score is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Machine Learning and Data Classification · Music and Audio Processing
