TL;DR
This paper explores the development and adaptation of RNN transducer models for spoken language understanding across various data availability scenarios, demonstrating effective use of synthetic speech and achieving state-of-the-art results.
Contribution
It introduces methods for building and adapting RNN-T SLU models from pre-trained ASR systems in diverse practical settings, including when only labels or synthetic speech are available.
Findings
RNN-T SLU models perform comparably to other end-to-end models.
Synthetic speech can effectively replace real audio for model adaptation.
State-of-the-art results achieved on ATIS and customer call datasets.
Abstract
We present a comprehensive study on building and adapting RNN transducer (RNN-T) models for spoken language understanding(SLU). These end-to-end (E2E) models are constructed in three practical settings: a case where verbatim transcripts are available, a constrained case where the only available annotations are SLU labels and their values, and a more restrictive case where transcripts are available but not corresponding audio. We show how RNN-T SLU models can be developed starting from pre-trained automatic speech recognition (ASR) systems, followed by an SLU adaptation step. In settings where real audio data is not available, artificially synthesized speech is used to successfully adapt various SLU models. When evaluated on two SLU data sets, the ATIS corpus and a customer call center data set, the proposed models closely track the performance of other E2E models and achieve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
