Sequential End-to-End Intent and Slot Label Classification and   Localization

Yiran Cao; Nihal Potdar; Anderson R. Avila

arXiv:2106.04660·cs.CL·June 10, 2021

Sequential End-to-End Intent and Slot Label Classification and Localization

Yiran Cao, Nihal Potdar, Anderson R. Avila

PDF

Open Access

TL;DR

This paper introduces a streaming end-to-end spoken language understanding model using 3D-CNN and LSTM that predicts intent and slot values directly from speech, achieving high accuracy and enabling localization of audio events.

Contribution

The paper presents a compact streaming SLU architecture with a novel use of CTL loss for simultaneous classification and localization, improving response latency in dialogue systems.

Findings

01

Achieves up to 98.97% accuracy in single-label classification.

02

Demonstrates effective localization of audio events with CTL.

03

Outperforms traditional methods in streaming SLU scenarios.

Abstract

Human-computer interaction (HCI) is significantly impacted by delayed responses from a spoken dialogue system. Hence, end-to-end (e2e) spoken language understanding (SLU) solutions have recently been proposed to decrease latency. Such approaches allow for the extraction of semantic information directly from the speech signal, thus bypassing the need for a transcript from an automatic speech recognition (ASR) system. In this paper, we propose a compact e2e SLU architecture for streaming scenarios, where chunks of the speech signal are processed continuously to predict intent and slot values. Our model is based on a 3D convolutional neural network (3D-CNN) and a unidirectional long short-term memory (LSTM). We compare the performance of two alignment-free losses: the connectionist temporal classification (CTC) method and its adapted version, namely connectionist temporal localization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and dialogue systems