Zero-shot Domain-sensitive Speech Recognition with Prompt-conditioning   Fine-tuning

Feng-Ting Liao; Yung-Chieh Chan; Yi-Chang Chen; Chan-Jan Hsu; Da-shan; Shiu

arXiv:2307.10274·eess.AS·October 9, 2023

Zero-shot Domain-sensitive Speech Recognition with Prompt-conditioning Fine-tuning

Feng-Ting Liao, Yung-Chieh Chan, Yi-Chang Chen, Chan-Jan Hsu, Da-shan, Shiu

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces a prompt-conditioned fine-tuning approach for speech recognition that enhances domain sensitivity, achieving significant WER reductions across multiple unseen domains using both audio and text-only data.

Contribution

It presents a novel prompt-conditioning fine-tuning method for domain-sensitive speech recognition, including a text-only adaptation technique for improved domain generalization.

Findings

01

Up to 33% WER reduction on unseen datasets.

02

Effective domain adaptation with text-only fine-tuning.

03

Model generalizes across diverse domains and prompt contexts.

Abstract

In this work, we propose a method to create domain-sensitive speech recognition models that utilize textual domain information by conditioning its generation on a given text prompt. This is accomplished by fine-tuning a pre-trained, end-to-end model (Whisper) to learn from demonstrations with prompt examples. We show that this ability can be generalized to different domains and even various prompt contexts, with our model gaining a Word Error Rate (WER) reduction of up to 33% on unseen datasets from various domains, such as medical conversation, air traffic control communication, and financial meetings. Considering the limited availability of audio-transcript pair data, we further extend our method to text-only fine-tuning to achieve domain sensitivity as well as domain adaptation. We demonstrate that our text-only fine-tuned model can also attend to various prompt contexts, with the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mtkresearch/clairaudience
pytorchOfficial

Models

🤗
MediaTek-Research/Clairaudience
model· 14 dl· ♡ 2
14 dl♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems