DNN-Based Semantic Model for Rescoring N-best Speech Recognition List
Dominique Fohr, Irina Illina

TL;DR
This paper introduces a DNN-based semantic rescoring method for N-best speech recognition hypotheses, utilizing word embeddings and acoustic features to reduce word error rate under noisy conditions.
Contribution
It proposes a novel DNN model that incorporates semantic and acoustic features for rescoring, improving speech recognition accuracy in noisy environments.
Findings
Significant WER reduction in noisy conditions
Effective use of word2vec and BERT embeddings
Improved performance over baseline models
Abstract
The word error rate (WER) of an automatic speech recognition (ASR) system increases when a mismatch occurs between the training and the testing conditions due to the noise, etc. In this case, the acoustic information can be less reliable. This work aims to improve ASR by modeling long-term semantic relations to compensate for distorted acoustic features. We propose to perform this through rescoring of the ASR N-best hypotheses list. To achieve this, we train a deep neural network (DNN). Our DNN rescoring model is aimed at selecting hypotheses that have better semantic consistency and therefore lower WER. We investigate two types of representations as part of input features to our DNN model: static word embeddings (from word2vec) and dynamic contextual embeddings (from BERT). Acoustic and linguistic features are also included. We perform experiments on the publicly available dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques
