SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding
Yu-An Chung, Chenguang Zhu, Michael Zeng

TL;DR
SPLAT is a semi-supervised joint pre-training framework for speech and language modules that enhances spoken language understanding by aligning acoustic and semantic representations, leading to significant performance improvements.
Contribution
The paper introduces SPLAT, a novel semi-supervised joint pre-training method that aligns speech and text representations in a shared space using limited paired data.
Findings
Improves SLU performance on multiple tasks.
Achieves over 10% improvement on Spoken SQuAD.
Effectively aligns speech and text representations.
Abstract
Spoken language understanding (SLU) requires a model to analyze input acoustic signal to understand its linguistic content and make predictions. To boost the models' performance, various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text. However, the inherent disparities between the two modalities necessitate a mutual analysis. In this paper, we propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules. Besides conducting a self-supervised masked language modeling task on the two individual modules using unpaired speech and text, SPLAT aligns representations from the two modules in a shared latent space using a small amount of paired speech and text. Thus, during fine-tuning, the speech module alone can produce representations carrying both acoustic information and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
