End-to-end architectures for ASR-free spoken language understanding
Elisavet Palogiannidi, Ioannis Gkinis, George Mastrapas, Petr Mizera,, Themos Stafylakis

TL;DR
This paper presents recurrent end-to-end neural architectures for spoken language understanding that achieve state-of-the-art intent classification on the FSC dataset without relying on ASR or pretrained models.
Contribution
The study introduces a set of recurrent architectures combined with data augmentation for end-to-end SLU, eliminating the need for ASR-level targets or pretrained ASR models.
Findings
Achieves state-of-the-art intent classification results on FSC dataset.
Models generalize reasonably well to unseen wordings.
Data augmentation enhances model performance.
Abstract
Spoken Language Understanding (SLU) is the problem of extracting the meaning from speech utterances. It is typically addressed as a two-step problem, where an Automatic Speech Recognition (ASR) model is employed to convert speech into text, followed by a Natural Language Understanding (NLU) model to extract meaning from the decoded text. Recently, end-to-end approaches were emerged, aiming at unifying the ASR and NLU into a single SLU deep neural architecture, trained using combinations of ASR and NLU-level recognition units. In this paper, we explore a set of recurrent architectures for intent classification, tailored to the recently introduced Fluent Speech Commands (FSC) dataset, where intents are formed as combinations of three slots (action, object, and location). We show that by combining deep recurrent architectures with standard data augmentation, state-of-the-art results can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
