Improving End-to-End Models for Set Prediction in Spoken Language Understanding
Hong-Kwang J. Kuo, Zoltan Tuske, Samuel Thomas, Brian Kingsbury,, George Saon

TL;DR
This paper enhances end-to-end spoken language understanding models for set prediction by proposing a data augmentation and alignment method, significantly improving F1 scores especially when entity spoken order is unknown.
Contribution
It introduces a novel data augmentation and implicit attention alignment technique to improve E2E SLU models handling unordered entity sets.
Findings
F1 scores increased by over 11% for RNN transducers.
F1 scores increased by about 2% for attention-based encoder-decoders.
Proposed methods outperform previous results in set prediction accuracy.
Abstract
The goal of spoken language understanding (SLU) systems is to determine the meaning of the input speech signal, unlike speech recognition which aims to produce verbatim transcripts. Advances in end-to-end (E2E) speech modeling have made it possible to train solely on semantic entities, which are far cheaper to collect than verbatim transcripts. We focus on this set prediction problem, where entity order is unspecified. Using two classes of E2E models, RNN transducers and attention based encoder-decoders, we show that these models work best when the training entity sequence is arranged in spoken order. To improve E2E SLU models when entity spoken order is unknown, we propose a novel data augmentation technique along with an implicit attention based alignment method to infer the spoken order. F1 scores significantly increased by more than 11% for RNN-T and about 2% for attention based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and dialogue systems
