WHISMA: A Speech-LLM to Perform Zero-shot Spoken Language Understanding

Mohan Li; Cong-Thanh Do; Simon Keizer; Youmna Farag; Svetlana; Stoyanchev; Rama Doddipatla

arXiv:2408.16423·eess.AS·August 30, 2024

WHISMA: A Speech-LLM to Perform Zero-shot Spoken Language Understanding

Mohan Li, Cong-Thanh Do, Simon Keizer, Youmna Farag, Svetlana, Stoyanchev, Rama Doddipatla

PDF

Open Access

TL;DR

WHISMA is a speech-LLM designed for zero-shot spoken language understanding, combining Whisper and Llama-3, achieving significant improvements in slot filling and domain generalization on SLU benchmarks.

Contribution

We introduce WHISMA, a novel speech-LLM that enhances zero-shot SLU performance and domain generalization through efficient fine-tuning and new benchmark evaluations.

Findings

01

26.6% relative improvement in zero-shot slot filling on SLURP

02

33.0% relative gain over Qwen-Audio on SLU-GLUE

03

Robust performance across diverse SLU tasks

Abstract

Speech large language models (speech-LLMs) integrate speech and text-based foundation models to provide a unified framework for handling a wide range of downstream tasks. In this paper, we introduce WHISMA, a speech-LLM tailored for spoken language understanding (SLU) that demonstrates robust performance in various zero-shot settings. WHISMA combines the speech encoder from Whisper with the Llama-3 LLM, and is fine-tuned in a parameter-efficient manner on a comprehensive collection of SLU-related datasets. Our experiments show that WHISMA significantly improves the zero-shot slot filling performance on the SLURP benchmark, achieving a relative gain of 26.6% compared to the current state-of-the-art model. Furthermore, to evaluate WHISMA's generalisation capabilities to unseen domains, we develop a new task-agnostic benchmark named SLU-GLUE. The evaluation results indicate that WHISMA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling