Large Language Model Guided Decoding for Self-Supervised Speech Recognition

Eyal Cohen (1); Bhiksha Raj (2); Joseph Keshet (1) ((1) Technion - Israel Institute of Technology; (2) Carnegie Mellon University)

arXiv:2508.02228·eess.AS·January 7, 2026

Large Language Model Guided Decoding for Self-Supervised Speech Recognition

Eyal Cohen (1), Bhiksha Raj (2), Joseph Keshet (1) ((1) Technion - Israel Institute of Technology, (2) Carnegie Mellon University)

PDF

Open Access

TL;DR

This paper introduces a novel decoding method for self-supervised speech recognition that effectively integrates large language models with acoustic models, improving transcription accuracy especially for complex and domain-specific speech inputs.

Contribution

The paper presents a new LLM-guided decoding approach that combines LLM predictions with SSL acoustic scores, outperforming existing methods in challenging speech recognition scenarios.

Findings

01

Outperforms current state-of-the-art LLM-based decoding methods.

02

Effective on complex sentences, acronyms, and domain-specific vocabulary.

03

Improves transcription accuracy in challenging speech inputs.

Abstract

Self-supervised automatic speech recognition (SSL-ASR) is an ASR approach that uses speech encoders pretrained on large amounts of unlabeled audio (e.g., wav2vec2.0 or HuBERT) and then fine-tunes them with limited labeled data to perform transcription. Decoding is usually performed with a CTC decoder, whose hypotheses are scored and refined using an external language model (LM), typically an n-gram or neural LM, which guides beam search to produce the final transcription. Using Large Language Models (LLMs) as external LMs remains a challenge, as their word probabilities are overly confident. The proposed method integrates an LLM with an SSL acoustic model by using the LLM's decoding mechanism to generate a set of candidate next tokens. For each candidate, the SSL model provides an acoustic score by aligning it to the input acoustics of the SSL model. A combined acoustic and LLM score is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Machine Learning and Data Classification · Music and Audio Processing