Customizing Speech Recognition Model with Large Language Model Feedback
Shaoshi Ling, Guoli Ye

TL;DR
This paper introduces a reinforcement learning approach that uses large language model feedback to improve speech recognition accuracy, especially for rare entities and domain mismatches, by fine-tuning the ASR model with unlabeled data.
Contribution
It presents a novel unsupervised domain adaptation method that leverages LLM-based feedback as a reward signal to enhance ASR performance on specific challenges.
Findings
Achieved 21% reduction in entity word error rate.
Demonstrated effectiveness of LLM feedback in unsupervised fine-tuning.
Outperformed conventional self-training methods.
Abstract
Automatic speech recognition (ASR) systems have achieved strong performance on general transcription tasks. However, they continue to struggle with recognizing rare named entities and adapting to domain mismatches. In contrast, large language models (LLMs), trained on massive internet-scale datasets, are often more effective across a wide range of domains. In this work, we propose a reinforcement learning based approach for unsupervised domain adaptation, leveraging unlabeled data to enhance transcription quality, particularly the named entities affected by domain mismatch, through feedback from a LLM. Given contextual information, our framework employs a LLM as the reward model to score the hypotheses from the ASR model. These scores serve as reward signals to fine-tune the ASR model via reinforcement learning. Our method achieves a 21\% improvement on entity word error rate over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
