Effective Text Adaptation for LLM-based ASR through Soft Prompt   Fine-Tuning

Yingyi Ma; Zhe Liu; Ozlem Kalinli

arXiv:2412.06967·cs.CL·December 11, 2024

Effective Text Adaptation for LLM-based ASR through Soft Prompt Fine-Tuning

Yingyi Ma, Zhe Liu, Ozlem Kalinli

PDF

Open Access

TL;DR

This paper introduces a two-step soft prompt fine-tuning method for LLM-based ASR that significantly improves domain-specific transcription accuracy by effectively leveraging text data without losing domain knowledge.

Contribution

The proposed soft prompt fine-tuning strategy enhances domain adaptation in LLM-based ASR without compromising the model's domain-specific knowledge.

Findings

01

Achieved up to 9% WER reduction on target domain

02

Realized up to 18% EER reduction with the method

03

Further improvements when combined with domain-specific LM fusion

Abstract

The advent of Large Language Models (LLM) has reformed the Automatic Speech Recognition (ASR). Prompting LLM with audio embeddings to generate transcriptions becomes the new state-of-the-art ASR. Despite LLMs being trained with an extensive amount of text corpora, high-quality domain-specific text data can still significantly enhance ASR performance on domain adaptation tasks. Although LLM-based ASR can naturally incorporate more text corpora by fine-tuning the LLM decoder, fine-tuning such ASR on text-only data without paired prompts may diminish the effectiveness of domain-specific knowledge. To mitigate this issue, we propose a two-step soft prompt fine-tuning strategy that enhances domain-specific text adaptation. Experimental results show that text adaptation with our proposed method achieved a relative up to 9% Word Error Rate (WER) reduction and up to 18% Entity Error Rate (EER)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Speech Recognition and Synthesis · Network Packet Processing and Optimization