Prompting Whisper for QA-driven Zero-shot End-to-end Spoken Language   Understanding

Mohan Li; Simon Keizer; Rama Doddipatla

arXiv:2406.15209·eess.AS·June 24, 2024·Interspeech

Prompting Whisper for QA-driven Zero-shot End-to-end Spoken Language Understanding

Mohan Li, Simon Keizer, Rama Doddipatla

PDF

Open Access

TL;DR

This paper introduces a zero-shot end-to-end spoken language understanding system using Whisper, leveraging a QA framework and prefix-tuning to achieve high accuracy with fewer parameters, outperforming recent benchmarks.

Contribution

It presents a novel approach combining Whisper with a QA framework and prefix-tuning for efficient zero-shot SLU, reducing model complexity while maintaining high performance.

Findings

01

40.7% absolute gain in slot filling (SLU-F1) on SLURP

02

Performs comparably to Whisper-GPT-2 system

03

34.8% reduction in model parameters

Abstract

Zero-shot spoken language understanding (SLU) enables systems to comprehend user utterances in new domains without prior exposure to training data. Recent studies often rely on large language models (LLMs), leading to excessive footprints and complexity. This paper proposes the use of Whisper, a standalone speech processing model, for zero-shot end-to-end (E2E) SLU. To handle unseen semantic labels, SLU tasks are integrated into a question-answering (QA) framework, which prompts the Whisper decoder for semantics deduction. The system is efficiently trained with prefix-tuning, optimising a minimal set of parameters rather than the entire Whisper model. We show that the proposed system achieves a 40.7% absolute gain for slot filling (SLU-F1) on SLURP compared to a recently introduced zero-shot benchmark. Furthermore, it performs comparably to a Whisper-GPT-2 modular system under both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsSparse Evolutionary Training