Deliberation Model for On-Device Spoken Language Understanding
Duc Le, Akshat Shrivastava, Paden Tomasello, Suyoun Kim, Aleksandr, Livshits, Ozlem Kalinli, Michael L. Seltzer

TL;DR
This paper introduces a deliberation-based end-to-end spoken language understanding system that improves accuracy and robustness by combining ASR and NLU with shared parameters, suitable for on-device deployment.
Contribution
It presents a novel deliberation model that integrates ASR and NLU with shared parameters, supporting complex semantics and robustness in resource-constrained environments.
Findings
Outperforms pipeline NLU baselines by 0.60-0.65% on TOPv2 dataset
Fusion of text and audio features enhances robustness to ASR errors
Reduces performance degradation when using synthetic speech for training
Abstract
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU), where a streaming automatic speech recognition (ASR) model produces the first-pass hypothesis and a second-pass natural language understanding (NLU) component generates the semantic parse by conditioning on both ASR's text and audio embeddings. By formulating E2E SLU as a generalized decoder, our system is able to support complex compositional semantic structures. Furthermore, the sharing of parameters between ASR and NLU makes the system especially suitable for resource-constrained (on-device) environments; our proposed approach consistently outperforms strong pipeline NLU baselines by 0.60% to 0.65% on the spoken version of the TOPv2 dataset (STOP). We demonstrate that the fusion of text and audio features, coupled with the system's ability to rewrite the first-pass hypothesis,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
