Customizing Speech Recognition Model with Large Language Model Feedback

Shaoshi Ling; Guoli Ye

arXiv:2506.11091·cs.CL·August 21, 2025

Customizing Speech Recognition Model with Large Language Model Feedback

Shaoshi Ling, Guoli Ye

PDF

Open Access

TL;DR

This paper introduces a reinforcement learning approach that uses large language model feedback to improve speech recognition accuracy, especially for rare entities and domain mismatches, by fine-tuning the ASR model with unlabeled data.

Contribution

It presents a novel unsupervised domain adaptation method that leverages LLM-based feedback as a reward signal to enhance ASR performance on specific challenges.

Findings

01

Achieved 21% reduction in entity word error rate.

02

Demonstrated effectiveness of LLM feedback in unsupervised fine-tuning.

03

Outperformed conventional self-training methods.

Abstract

Automatic speech recognition (ASR) systems have achieved strong performance on general transcription tasks. However, they continue to struggle with recognizing rare named entities and adapting to domain mismatches. In contrast, large language models (LLMs), trained on massive internet-scale datasets, are often more effective across a wide range of domains. In this work, we propose a reinforcement learning based approach for unsupervised domain adaptation, leveraging unlabeled data to enhance transcription quality, particularly the named entities affected by domain mismatch, through feedback from a LLM. Given contextual information, our framework employs a LLM as the reward model to score the hypotheses from the ASR model. These scores serve as reward signals to fine-tune the ASR model via reinforcement learning. Our method achieves a 21\% improvement on entity word error rate over…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis