DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech   Units for Spoken Language Understanding

Suwon Shon; Kwangyoun Kim; Yi-Te Hsu; Prashant Sridhar; Shinji; Watanabe; Karen Livescu

arXiv:2406.09345·cs.CL·June 14, 2024

DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding

Suwon Shon, Kwangyoun Kim, Yi-Te Hsu, Prashant Sridhar, Shinji, Watanabe, Karen Livescu

PDF

Open Access

TL;DR

This paper introduces DiscreteSLU, a speech understanding model that uses self-supervised discrete speech units instead of continuous features, improving robustness and instruction-following in spoken language tasks.

Contribution

It proposes a novel approach of converting self-supervised speech encoder outputs into discrete units for better speech understanding in LLMs.

Findings

01

Robust performance on seen and unseen domains

02

Effective instruction-following in spoken question answering

03

Discrete units outperform continuous features in some tasks

Abstract

The integration of pre-trained text-based large language models (LLM) with speech input has enabled instruction-following capabilities for diverse speech tasks. This integration requires the use of a speech encoder, a speech adapter, and an LLM, trained on diverse tasks. We propose the use of discrete speech units (DSU), rather than continuous-valued speech encoder outputs, that are converted to the LLM token embedding space using the speech adapter. We generate DSU using a self-supervised speech encoder followed by k-means clustering. The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering. We also explore various types of DSU extracted from different layers of the self-supervised speech encoder, as well as Mel frequency Cepstral Coefficients (MFCC). Our findings suggest that the ASR task…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Natural Language Processing Techniques