TL;DR
This paper introduces the AT-AT model, a multi-task end-to-end system that jointly trains on speech and text tasks, improving SLU performance and enabling zero-shot capabilities, with state-of-the-art results on multiple datasets.
Contribution
The paper presents a novel multi-task E2E model that leverages both speech and text data, outperforming single-task models and enabling zero-shot SLU without speech data.
Findings
Achieves state-of-the-art results on internal and public datasets.
Demonstrates effective zero-shot E2E SLU performance.
Outperforms models trained on limited data.
Abstract
Voice Assistants such as Alexa, Siri, and Google Assistant typically use a two-stage Spoken Language Understanding pipeline; first, an Automatic Speech Recognition (ASR) component to process customer speech and generate text transcriptions, followed by a Natural Language Understanding (NLU) component to map transcriptions to an actionable hypothesis. An end-to-end (E2E) system that goes directly from speech to a hypothesis is a more attractive option. These systems were shown to be smaller, faster, and better optimized. However, they require massive amounts of end-to-end training data and in addition, don't take advantage of the already available ASR and NLU training data. In this work, we propose an E2E system that is designed to jointly train on multiple speech-to-text tasks, such as ASR (speech-transcription) and SLU (speech-hypothesis), and text-to-text tasks, such as NLU…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
