End-to-End Neural Transformer Based Spoken Language Understanding
Martin Radfar, Athanasios Mouchtaris, and Siegfried Kunzmann

TL;DR
This paper introduces an end-to-end neural transformer model for spoken language understanding that accurately predicts semantic information directly from audio signals, outperforming previous models and enabling efficient on-device processing.
Contribution
The paper presents the first end-to-end transformer-based SLU model that predicts domain, intent, and slots directly from audio without intermediate token steps, improving accuracy and efficiency.
Findings
Achieves 98.1% domain accuracy, 99.6% intent and slot accuracy on Fluent Speech Commands.
Outperforms recurrent and convolutional neural network models by 1.4%.
Model is 25% smaller and highly parallelizable, suitable for on-device SLU.
Abstract
Spoken language understanding (SLU) refers to the process of inferring the semantic information from audio signals. While the neural transformers consistently deliver the best performance among the state-of-the-art neural architectures in field of natural language processing (NLP), their merits in a closely related field, i.e., spoken language understanding (SLU) have not beed investigated. In this paper, we introduce an end-to-end neural transformer-based SLU model that can predict the variable-length domain, intent, and slots vectors embedded in an audio signal with no intermediate token prediction architecture. This new architecture leverages the self-attention mechanism by which the audio signal is transformed to various sub-subspaces allowing to extract the semantic context implied by an utterance. Our end-to-end transformer SLU predicts the domains, intents and slots in the Fluent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
