End-to-End Spoken Language Understanding: Performance analyses of a voice command task in a low resource setting
Thierry Desot, Fran\c{c}ois Portet, Michel Vacher

TL;DR
This study analyzes how end-to-end speech understanding models perform in low-resource, non-English smart home commands, revealing their robustness to noise and their reliance on pitch features for concept detection.
Contribution
It provides the first detailed analysis of the linguistic and signal features used by E2E SLU models in a low-resource, non-English context, highlighting their advantages over pipeline systems.
Findings
E2E SLU performs well without perfect ASR.
E2E models handle noise and syntactic variation better.
Pitch features are used by E2E models to identify concepts.
Abstract
Spoken Language Understanding (SLU) is a core task in most human-machine interaction systems. With the emergence of smart homes, smart phones and smart speakers, SLU has become a key technology for the industry. In a classical SLU approach, an Automatic Speech Recognition (ASR) module transcribes the speech signal into a textual representation from which a Natural Language Understanding (NLU) module extracts semantic information. Recently End-to-End SLU (E2E SLU) based on Deep Neural Networks has gained momentum since it benefits from the joint optimization of the ASR and the NLU parts, hence limiting the cascade of error effect of the pipeline architecture. However, little is known about the actual linguistic properties used by E2E models to predict concepts and intents from speech input. In this paper, we present a study identifying the signal features and other linguistic properties…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
