A Deep Learning System for Domain-specific Speech Recognition
Yanan Jia

TL;DR
This paper develops a domain-specific speech recognition system using deep learning models, semi-supervised data collection, and fine-tuning, outperforming commercial systems on specialized speech tasks even with higher error rates.
Contribution
It introduces a semi-supervised annotation method and demonstrates that fine-tuned Wav2Vec2 models can surpass commercial ASR systems in domain-specific applications.
Findings
Fine-tuned Wav2Vec2-Large-LV60 outperforms Google and AWS ASR systems on benefit-specific speech.
Domain-specific ASR transcriptions, despite higher WER, can be effectively used in spoken language understanding.
Fine-tuned ASR results are comparable to human transcriptions in NLU tasks.
Abstract
As human-machine voice interfaces provide easy access to increasingly intelligent machines, many state-of-the-art automatic speech recognition (ASR) systems are proposed. However, commercial ASR systems usually have poor performance on domain-specific speech especially under low-resource settings. The author works with pre-trained DeepSpeech2 and Wav2Vec2 acoustic models to develop benefit-specific ASR systems. The domain-specific data are collected using proposed semi-supervised learning annotation with little human intervention. The best performance comes from a fine-tuned Wav2Vec2-Large-LV60 acoustic model with an external KenLM, which surpasses the Google and AWS ASR systems on benefit-specific speech. The viability of using error prone ASR transcriptions as part of spoken language understanding (SLU) is also investigated. Results of a benefit-specific natural language understanding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques
