Three-Module Modeling For End-to-End Spoken Language Understanding Using   Pre-trained DNN-HMM-Based Acoustic-Phonetic Model

Nick J.C. Wang; Lu Wang; Yandan Sun; Haimei Kang; Dejun; Zhang

arXiv:2204.03315·cs.CL·April 8, 2022

Three-Module Modeling For End-to-End Spoken Language Understanding Using Pre-trained DNN-HMM-Based Acoustic-Phonetic Model

Nick J.C. Wang, Lu Wang, Yandan Sun, Haimei Kang, Dejun, Zhang

PDF

Open Access

TL;DR

This paper introduces a three-module, streaming end-to-end spoken language understanding model that leverages a pre-trained DNN-HMM acoustic-phonetic system and multi-target learning to significantly improve intent classification accuracy.

Contribution

It proposes a novel three-module streaming SLU model using an open-source acoustic-phonetic module and multi-target learning, achieving substantial error rate reductions.

Findings

01

40% relative reduction in intent-classification error rates

02

99.4% intent accuracy on FluentSpeech dataset

03

50% intent error rate reduction compared to prior work

Abstract

In spoken language understanding (SLU), what the user says is converted to his/her intent. Recent work on end-to-end SLU has shown that accuracy can be improved via pre-training approaches. We revisit ideas presented by Lugosch et al. using speech pre-training and three-module modeling; however, to ease construction of the end-to-end SLU model, we use as our phoneme module an open-source acoustic-phonetic model from a DNN-HMM hybrid automatic speech recognition (ASR) system instead of training one from scratch. Hence we fine-tune on speech only for the word module, and we apply multi-target learning (MTL) on the word and intent modules to jointly optimize SLU performance. MTL yields a relative reduction of 40% in intent-classification error rates (from 1.0% to 0.6%). Note that our three-module model is a streaming method. The final outcome of the proposed three-module modeling approach…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and dialogue systems