A Deep Learning System for Domain-specific Speech Recognition

Yanan Jia

arXiv:2303.10510·cs.CL·September 28, 2023·1 cites

A Deep Learning System for Domain-specific Speech Recognition

Yanan Jia

PDF

Open Access

TL;DR

This paper develops a domain-specific speech recognition system using deep learning models, semi-supervised data collection, and fine-tuning, outperforming commercial systems on specialized speech tasks even with higher error rates.

Contribution

It introduces a semi-supervised annotation method and demonstrates that fine-tuned Wav2Vec2 models can surpass commercial ASR systems in domain-specific applications.

Findings

01

Fine-tuned Wav2Vec2-Large-LV60 outperforms Google and AWS ASR systems on benefit-specific speech.

02

Domain-specific ASR transcriptions, despite higher WER, can be effectively used in spoken language understanding.

03

Fine-tuned ASR results are comparable to human transcriptions in NLU tasks.

Abstract

As human-machine voice interfaces provide easy access to increasingly intelligent machines, many state-of-the-art automatic speech recognition (ASR) systems are proposed. However, commercial ASR systems usually have poor performance on domain-specific speech especially under low-resource settings. The author works with pre-trained DeepSpeech2 and Wav2Vec2 acoustic models to develop benefit-specific ASR systems. The domain-specific data are collected using proposed semi-supervised learning annotation with little human intervention. The best performance comes from a fine-tuned Wav2Vec2-Large-LV60 acoustic model with an external KenLM, which surpasses the Google and AWS ASR systems on benefit-specific speech. The viability of using error prone ASR transcriptions as part of spoken language understanding (SLU) is also investigated. Results of a benefit-specific natural language understanding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques