Multi-task RNN-T with Semantic Decoder for Streamable Spoken Language Understanding
Xuandi Fu, Feng-Ju Chang, Martin Radfar, Kai Wei, Jing Liu, Grant P., Strimel, Kanthashree Mysore Sathyendra

TL;DR
This paper introduces a streamable multi-task RNN-T model with a semantic decoder that jointly predicts speech transcripts and semantic labels, improving accuracy and reducing latency in spoken language understanding systems.
Contribution
It proposes a novel end-to-end multi-task RNN-T architecture with a semantic decoder that considers previous predictions, enhancing streamability and performance over traditional two-stage models.
Findings
Outperforms two-stage E2E SLU models on industry and public datasets.
Achieves better ASR and NLU metrics with lower latency.
Demonstrates effective joint optimization of speech recognition and semantic understanding.
Abstract
End-to-end Spoken Language Understanding (E2E SLU) has attracted increasing interest due to its advantages of joint optimization and low latency when compared to traditionally cascaded pipelines. Existing E2E SLU models usually follow a two-stage configuration where an Automatic Speech Recognition (ASR) network first predicts a transcript which is then passed to a Natural Language Understanding (NLU) module through an interface to infer semantic labels, such as intent and slot tags. This design, however, does not consider the NLU posterior while making transcript predictions, nor correct the NLU prediction error immediately by considering the previously predicted word-pieces. In addition, the NLU model in the two-stage system is not streamable, as it must wait for the audio segments to complete processing, which ultimately impacts the latency of the SLU system. In this work, we propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Topic Modeling
