Discriminative Speech Recognition Rescoring with Pre-trained Language   Models

Prashanth Gurunath Shivakumar; Jari Kolehmainen; Yile Gu; Ankur; Gandhe; Ariya Rastrow; Ivan Bulyko

arXiv:2310.06248·eess.AS·October 11, 2023·ASRU

Discriminative Speech Recognition Rescoring with Pre-trained Language Models

Prashanth Gurunath Shivakumar, Jari Kolehmainen, Yile Gu, Ankur, Gandhe, Ariya Rastrow, Ivan Bulyko

PDF

Open Access

TL;DR

This paper explores discriminative fine-tuning of pre-trained language models for second pass speech recognition rescoring, showing significant WER improvements and analyzing different architectures and bidirectionality effects.

Contribution

It introduces novel discriminative training schemes and pooling strategies for pre-trained LMs, demonstrating their effectiveness in speech recognition rescoring.

Findings

01

MWER training improves WER by up to 8.5%

02

Pooling variants reduce latency while maintaining gains

03

Bidirectional LMs outperform causal models in discriminative settings

Abstract

Second pass rescoring is a critical component of competitive automatic speech recognition (ASR) systems. Large language models have demonstrated their ability in using pre-trained information for better rescoring of ASR hypothesis. Discriminative training, directly optimizing the minimum word-error-rate (MWER) criterion typically improves rescoring. In this study, we propose and explore several discriminative fine-tuning schemes for pre-trained LMs. We propose two architectures based on different pooling strategies of output embeddings and compare with probability based MWER. We conduct detailed comparisons between pre-trained causal and bidirectional LMs in discriminative settings. Experiments on LibriSpeech demonstrate that all MWER training schemes are beneficial, giving additional gains upto 8.5\% WER. Proposed pooling variants achieve lower latency while retaining most…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsDiscriminative Fine-Tuning