Discriminative Speech Recognition Rescoring with Pre-trained Language Models
Prashanth Gurunath Shivakumar, Jari Kolehmainen, Yile Gu, Ankur, Gandhe, Ariya Rastrow, Ivan Bulyko

TL;DR
This paper explores discriminative fine-tuning of pre-trained language models for second pass speech recognition rescoring, showing significant WER improvements and analyzing different architectures and bidirectionality effects.
Contribution
It introduces novel discriminative training schemes and pooling strategies for pre-trained LMs, demonstrating their effectiveness in speech recognition rescoring.
Findings
MWER training improves WER by up to 8.5%
Pooling variants reduce latency while maintaining gains
Bidirectional LMs outperform causal models in discriminative settings
Abstract
Second pass rescoring is a critical component of competitive automatic speech recognition (ASR) systems. Large language models have demonstrated their ability in using pre-trained information for better rescoring of ASR hypothesis. Discriminative training, directly optimizing the minimum word-error-rate (MWER) criterion typically improves rescoring. In this study, we propose and explore several discriminative fine-tuning schemes for pre-trained LMs. We propose two architectures based on different pooling strategies of output embeddings and compare with probability based MWER. We conduct detailed comparisons between pre-trained causal and bidirectional LMs in discriminative settings. Experiments on LibriSpeech demonstrate that all MWER training schemes are beneficial, giving additional gains upto 8.5\% WER. Proposed pooling variants achieve lower latency while retaining most…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsDiscriminative Fine-Tuning
