SPLAT: Speech-Language Joint Pre-Training for Spoken Language   Understanding

Yu-An Chung; Chenguang Zhu; Michael Zeng

arXiv:2010.02295·cs.CL·March 16, 2021·1 cites

SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding

Yu-An Chung, Chenguang Zhu, Michael Zeng

PDF

Open Access 2 Repos

TL;DR

SPLAT is a semi-supervised joint pre-training framework for speech and language modules that enhances spoken language understanding by aligning acoustic and semantic representations, leading to significant performance improvements.

Contribution

The paper introduces SPLAT, a novel semi-supervised joint pre-training method that aligns speech and text representations in a shared space using limited paired data.

Findings

01

Improves SLU performance on multiple tasks.

02

Achieves over 10% improvement on Spoken SQuAD.

03

Effectively aligns speech and text representations.

Abstract

Spoken language understanding (SLU) requires a model to analyze input acoustic signal to understand its linguistic content and make predictions. To boost the models' performance, various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text. However, the inherent disparities between the two modalities necessitate a mutual analysis. In this paper, we propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules. Besides conducting a self-supervised masked language modeling task on the two individual modules using unpaired speech and text, SPLAT aligns representations from the two modules in a shared latent space using a small amount of paired speech and text. Thus, during fine-tuning, the speech module alone can produce representations carrying both acoustic information and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling