Reducing Prompt Sensitivity in LLM-based Speech Recognition Through Learnable Projection
Sergio Burdisso, Esa\'u Villatoro-Tello, Shashi Kumar, Srikanth Madikeri, Andr\'es Carofilis, Pradeep Rangappa, Manjunath K E, Kadri Hacioglu, Petr Motlicek, Andreas Stolcke

TL;DR
This paper analyzes prompt sensitivity in LLM-based speech recognition and introduces a learnable prompt projector that improves performance and stability across datasets.
Contribution
It proposes a learnable prompt projector module that enhances prompt effectiveness without altering the underlying LLM-based ASR model.
Findings
Prompt choice significantly affects ASR performance and stability.
The prompt projector consistently improves performance across datasets.
It outperforms manually selected prompts in various scenarios.
Abstract
LLM-based automatic speech recognition (ASR), a well-established approach, connects speech foundation models to large language models (LLMs) through a speech-to-LLM projector, yielding promising results. A common design choice in these architectures is the use of a fixed, manually defined prompt during both training and inference. This setup not only enables applicability across a range of practical scenarios, but also helps maximize model performance. However, the impact of prompt design remains underexplored. This paper presents a comprehensive analysis of commonly used prompts across diverse datasets, showing that prompt choice significantly affects ASR performance and introduces instability, with no single prompt performing best across all cases. Inspired by the speech-to-LLM projector, we propose a prompt projector module, a simple, model-agnostic extension that learns to project…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
