RLBR: Reinforcement Learning with Biasing Rewards for Contextual Speech Large Language Models
Bo Ren, Ruchao Fan, Yelong Shen, Weizhu Chen, Jinyu Li

TL;DR
This paper introduces RLBR, a reinforcement learning method that improves speech large language models' recognition of rare and domain-specific words by biasing rewards and reference-aware mechanisms, leading to significant performance gains.
Contribution
The paper proposes RLBR, a novel fine-tuning approach that explicitly emphasizes biasing words using specialized rewards and reference-aware reinforcement learning, enhancing recognition accuracy.
Findings
RLBR significantly reduces Biasing Word Error Rates on LibriSpeech.
The method outperforms recent approaches across various biasing list sizes.
RLBR maintains overall WER while improving biasing word recognition.
Abstract
Speech large language models (LLMs) have driven significant progress in end-to-end speech understanding and recognition, yet they continue to struggle with accurately recognizing rare words and domain-specific terminology. This paper presents a novel fine-tuning method, Reinforcement Learning with Biasing Rewards (RLBR), which employs a specialized biasing words preferred reward to explicitly emphasize biasing words in the reward calculation. In addition, we introduce reference-aware mechanisms that extend the reinforcement learning algorithm with reference transcription to strengthen the potential trajectory exploration space. Experiments on the LibriSpeech corpus across various biasing list sizes demonstrate that RLBR delivers substantial performance improvements over a strong supervised fine-tuning (SFT) baseline and consistently outperforms several recently published methods. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
