Leveraging Large Language Models for Exploiting ASR Uncertainty
Pranay Dighe, Yi Su, Shangshang Zheng, Yunshu Liu, Vineet Garg,, Xiaochuan Niu, Ahmed Tewfik

TL;DR
This paper demonstrates that prompting large language models with n-best ASR hypotheses improves speech intent classification and keyword spotting, effectively exploiting ASR uncertainty without changing core models.
Contribution
It introduces a method of using n-best ASR hypotheses as prompts for LLMs, enhancing speech understanding performance without modifying the underlying models.
Findings
n-best list prompts outperform 1-best hypotheses in speech tasks
Prompt engineering and fine-tuning improve LLM performance on spoken language understanding
Approach is effective on device-directed speech detection and keyword spotting
Abstract
While large language models excel in a variety of natural language processing (NLP) tasks, to perform well on spoken language understanding (SLU) tasks, they must either rely on off-the-shelf automatic speech recognition (ASR) systems for transcription, or be equipped with an in-built speech modality. This work focuses on the former scenario, where LLM's accuracy on SLU tasks is constrained by the accuracy of a fixed ASR system on the spoken input. Specifically, we tackle speech-intent classification task, where a high word-error-rate can limit the LLM's ability to understand the spoken intent. Instead of chasing a high accuracy by designing complex or specialized architectures regardless of deployment costs, we seek to answer how far we can go without substantially changing the underlying ASR and LLM, which can potentially be shared by multiple unrelated tasks. To this end, we propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
