Bridging Speech and Textual Pre-trained Models with Unsupervised ASR
Jiatong Shi, Chan-Jan Hsu, Holam Chung, Dongji Gao, Paola Garcia,, Shinji Watanabe, Ann Lee, Hung-yi Lee

TL;DR
This paper introduces an unsupervised method that uses automatic speech recognition to connect speech and text models, significantly improving performance across multiple spoken language understanding tasks, including state-of-the-art results in spoken question answering.
Contribution
It presents a simple, unsupervised paradigm using ASR as a bridge between speech and textual pre-trained models, enhancing SLU task performance.
Findings
Unsupervised ASR improves speech model representations.
The method enhances performance on five SLU tasks.
Achieves state-of-the-art on NMSQA benchmark.
Abstract
Spoken language understanding (SLU) is a task aiming to extract high-level semantics from spoken utterances. Previous works have investigated the use of speech self-supervised models and textual pre-trained models, which have shown reasonable improvements to various SLU tasks. However, because of the mismatched modalities between speech signals and text tokens, previous methods usually need complex designs of the frameworks. This work proposes a simple yet efficient unsupervised paradigm that connects speech and textual pre-trained models, resulting in an unsupervised speech-to-semantic pre-trained model for various tasks in SLU. To be specific, we propose to use unsupervised automatic speech recognition (ASR) as a connector that bridges different modalities used in speech and textual pre-trained models. Our experiments show that unsupervised ASR itself can improve the representations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
