Modality Confidence Aware Training for Robust End-to-End Spoken Language   Understanding

Suyoun Kim; Akshat Shrivastava; Duc Le; Ju Lin; Ozlem Kalinli; Michael; L. Seltzer

arXiv:2307.12134·cs.CL·July 25, 2023

Modality Confidence Aware Training for Robust End-to-End Spoken Language Understanding

Suyoun Kim, Akshat Shrivastava, Duc Le, Ju Lin, Ozlem Kalinli, Michael, L. Seltzer

PDF

Open Access

TL;DR

This paper introduces a modality confidence aware training method for end-to-end spoken language understanding, improving robustness to ASR errors by fusing audio and text representations based on confidence estimates.

Contribution

It proposes novel techniques to encode ASR hypothesis quality and integrate it into E2E SLU models, enhancing robustness against transcription errors.

Findings

01

Improved accuracy on the STOP dataset.

02

Effective encoding of ASR hypothesis quality.

03

Enhanced robustness to ASR errors.

Abstract

End-to-end (E2E) spoken language understanding (SLU) systems that generate a semantic parse from speech have become more promising recently. This approach uses a single model that utilizes audio and text representations from pre-trained speech recognition models (ASR), and outperforms traditional pipeline SLU systems in on-device streaming scenarios. However, E2E SLU systems still show weakness when text representation quality is low due to ASR transcription errors. To overcome this issue, we propose a novel E2E SLU system that enhances robustness to ASR errors by fusing audio and text representations based on the estimated modality confidence of ASR hypotheses. We introduce two novel techniques: 1) an effective method to encode the quality of ASR hypotheses and 2) an effective approach to integrate them into E2E SLU models. We show accuracy improvements on STOP dataset and share the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and dialogue systems