Pre-training for Spoken Language Understanding with Joint Textual and Phonetic Representation Learning
Qian Chen, Wen Wang, Qinglin Zhang

TL;DR
This paper introduces a joint textual-phonetic pre-training method for spoken language understanding that leverages phonetic information to enhance model robustness against speech recognition errors, outperforming existing baselines.
Contribution
It proposes a novel pre-training approach combining textual and phonetic data, improving robustness and performance of end-to-end spoken language understanding models.
Findings
Significant performance improvements on Fluent Speech Commands and SNIPS benchmarks.
Enhanced robustness of SLU models to ASR errors.
Effective integration of phonetic information during pre-training and fine-tuning.
Abstract
In the traditional cascading architecture for spoken language understanding (SLU), it has been observed that automatic speech recognition errors could be detrimental to the performance of natural language understanding. End-to-end (E2E) SLU models have been proposed to directly map speech input to desired semantic frame with a single model, hence mitigating ASR error propagation. Recently, pre-training technologies have been explored for these E2E models. In this paper, we propose a novel joint textual-phonetic pre-training approach for learning spoken language representations, aiming at exploring the full potentials of phonetic information to improve SLU robustness to ASR errors. We explore phoneme labels as high-level speech features, and design and compare pre-training tasks based on conditional masked language model objectives and inter-sentence relation objectives. We also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
