Sequential End-to-End Intent and Slot Label Classification and Localization
Yiran Cao, Nihal Potdar, Anderson R. Avila

TL;DR
This paper introduces a streaming end-to-end spoken language understanding model using 3D-CNN and LSTM that predicts intent and slot values directly from speech, achieving high accuracy and enabling localization of audio events.
Contribution
The paper presents a compact streaming SLU architecture with a novel use of CTL loss for simultaneous classification and localization, improving response latency in dialogue systems.
Findings
Achieves up to 98.97% accuracy in single-label classification.
Demonstrates effective localization of audio events with CTL.
Outperforms traditional methods in streaming SLU scenarios.
Abstract
Human-computer interaction (HCI) is significantly impacted by delayed responses from a spoken dialogue system. Hence, end-to-end (e2e) spoken language understanding (SLU) solutions have recently been proposed to decrease latency. Such approaches allow for the extraction of semantic information directly from the speech signal, thus bypassing the need for a transcript from an automatic speech recognition (ASR) system. In this paper, we propose a compact e2e SLU architecture for streaming scenarios, where chunks of the speech signal are processed continuously to predict intent and slot values. Our model is based on a 3D convolutional neural network (3D-CNN) and a unidirectional long short-term memory (LSTM). We compare the performance of two alignment-free losses: the connectionist temporal classification (CTC) method and its adapted version, namely connectionist temporal localization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and dialogue systems
