Pre-training for Spoken Language Understanding with Joint Textual and   Phonetic Representation Learning

Qian Chen; Wen Wang; Qinglin Zhang

arXiv:2104.10357·cs.CL·September 2, 2021

Pre-training for Spoken Language Understanding with Joint Textual and Phonetic Representation Learning

Qian Chen, Wen Wang, Qinglin Zhang

PDF

TL;DR

This paper introduces a joint textual-phonetic pre-training method for spoken language understanding that leverages phonetic information to enhance model robustness against speech recognition errors, outperforming existing baselines.

Contribution

It proposes a novel pre-training approach combining textual and phonetic data, improving robustness and performance of end-to-end spoken language understanding models.

Findings

01

Significant performance improvements on Fluent Speech Commands and SNIPS benchmarks.

02

Enhanced robustness of SLU models to ASR errors.

03

Effective integration of phonetic information during pre-training and fine-tuning.

Abstract

In the traditional cascading architecture for spoken language understanding (SLU), it has been observed that automatic speech recognition errors could be detrimental to the performance of natural language understanding. End-to-end (E2E) SLU models have been proposed to directly map speech input to desired semantic frame with a single model, hence mitigating ASR error propagation. Recently, pre-training technologies have been explored for these E2E models. In this paper, we propose a novel joint textual-phonetic pre-training approach for learning spoken language representations, aiming at exploring the full potentials of phonetic information to improve SLU robustness to ASR errors. We explore phoneme labels as high-level speech features, and design and compare pre-training tasks based on conditional masked language model objectives and inter-sentence relation objectives. We also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.