Introducing Semantics into Speech Encoders
Derek Xu, Shuyan Dong, Changhan Wang, Suyoun Kim, Zhaojiang Lin,, Akshat Shrivastava, Shang-Wen Li, Liang-Hsuan Tseng, Alexei Baevski,, Guan-Ting Lin, Hung-yi Lee, Yizhou Sun, Wei Wang

TL;DR
This paper presents an unsupervised method to incorporate semantic information from large language models into self-supervised speech encoders, significantly enhancing their spoken language understanding capabilities without requiring labeled audio data.
Contribution
It introduces a task-agnostic, unsupervised approach to embed semantic information into speech encoders, reducing reliance on costly labeled transcriptions.
Findings
Over 10% improvement in intent classification accuracy
Modest gains in named entity resolution and slot filling
Over 2% increase in spoken question answering FF1 score
Abstract
Recent studies find existing self-supervised speech encoders contain primarily acoustic rather than semantic information. As a result, pipelined supervised automatic speech recognition (ASR) to large language model (LLM) systems achieve state-of-the-art results on semantic spoken language tasks by utilizing rich semantic representations from the LLM. These systems come at the cost of labeled audio transcriptions, which is expensive and time-consuming to obtain. We propose a task-agnostic unsupervised way of incorporating semantic information from LLMs into self-supervised speech encoders without labeled audio transcriptions. By introducing semantics, we improve existing speech encoder spoken language understanding performance by over 10\% on intent classification, with modest gains in named entity resolution and slot filling, and spoken question answering FF1 score by over 2\%. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
