WAVPROMPT: Towards Few-Shot Spoken Language Understanding with Frozen   Language Models

Heting Gao; Junrui Ni; Kaizhi Qian; Yang Zhang; Shiyu Chang; Mark; Hasegawa-Johnson

arXiv:2203.15863·eess.AS·April 15, 2022·1 cites

WAVPROMPT: Towards Few-Shot Spoken Language Understanding with Frozen Language Models

Heting Gao, Junrui Ni, Kaizhi Qian, Yang Zhang, Shiyu Chang, Mark, Hasegawa-Johnson

PDF

Open Access 1 Repo

TL;DR

WavPrompt leverages frozen language models and fine-tuned wav2vec to enable few-shot speech understanding, surpassing naive text baselines and extracting richer information beyond transcriptions.

Contribution

The paper introduces WavPrompt, a novel framework that adapts frozen language models for few-shot speech understanding using a fine-tuned wav2vec encoder.

Findings

01

WavPrompt outperforms naive text baselines in speech understanding tasks.

02

Detailed ablation studies identify optimal model configurations.

03

WavPrompt can extract additional non-speech information beyond transcriptions.

Abstract

Large-scale auto-regressive language models pretrained on massive text have demonstrated their impressive ability to perform new natural language tasks with only a few text examples, without the need for fine-tuning. Recent studies further show that such a few-shot learning ability can be extended to the text-image setting by training an encoder to encode the images into embeddings functioning like the text embeddings of the language model. Interested in exploring the possibility of transferring the few-shot learning ability to the audio-text setting, we propose a novel speech understanding framework, WavPrompt, where we finetune a wav2vec model to generate a sequence of audio embeddings understood by the language model. We show that WavPrompt is a few-shot learner that can perform speech understanding tasks better than a naive text baseline. We conduct detailed ablation studies on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hertin/wavprompt
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques