Introducing Semantics into Speech Encoders

Derek Xu; Shuyan Dong; Changhan Wang; Suyoun Kim; Zhaojiang Lin,; Akshat Shrivastava; Shang-Wen Li; Liang-Hsuan Tseng; Alexei Baevski,; Guan-Ting Lin; Hung-yi Lee; Yizhou Sun; Wei Wang

arXiv:2211.08402·cs.CL·November 16, 2022

Introducing Semantics into Speech Encoders

Derek Xu, Shuyan Dong, Changhan Wang, Suyoun Kim, Zhaojiang Lin,, Akshat Shrivastava, Shang-Wen Li, Liang-Hsuan Tseng, Alexei Baevski,, Guan-Ting Lin, Hung-yi Lee, Yizhou Sun, Wei Wang

PDF

Open Access

TL;DR

This paper presents an unsupervised method to incorporate semantic information from large language models into self-supervised speech encoders, significantly enhancing their spoken language understanding capabilities without requiring labeled audio data.

Contribution

It introduces a task-agnostic, unsupervised approach to embed semantic information into speech encoders, reducing reliance on costly labeled transcriptions.

Findings

01

Over 10% improvement in intent classification accuracy

02

Modest gains in named entity resolution and slot filling

03

Over 2% increase in spoken question answering FF1 score

Abstract

Recent studies find existing self-supervised speech encoders contain primarily acoustic rather than semantic information. As a result, pipelined supervised automatic speech recognition (ASR) to large language model (LLM) systems achieve state-of-the-art results on semantic spoken language tasks by utilizing rich semantic representations from the LLM. These systems come at the cost of labeled audio transcriptions, which is expensive and time-consuming to obtain. We propose a task-agnostic unsupervised way of incorporating semantic information from LLMs into self-supervised speech encoders without labeled audio transcriptions. By introducing semantics, we improve existing speech encoder spoken language understanding performance by over 10\% on intent classification, with modest gains in named entity resolution and slot filling, and spoken question answering FF1 score by over 2\%. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling