TL;DR
This paper improves speech recognition accuracy for rare bias words by leveraging acoustic cues and bias word position prediction in speech-aware LLMs, without requiring phonetic expertise or G2P tools.
Contribution
It introduces a phoneme-free contextual biasing method using acoustic cues and a multi-output bias word position predictor, enhancing robustness and accuracy.
Findings
Reduces bias word recognition errors by 16.3%
Improves out-of-domain recognition accuracy
Eliminates need for phonetic knowledge or G2P tools
Abstract
Speech-aware LLMs (SLLMs) have recently achieved state-of-the-art ASR performance; however, they still fail to accurately transcribe bias words that appear rarely or never in the training data. Contextual biasing mechanisms are commonly implemented by introducing a predefined bias word list into the model via a text prompt or additional module. For further improvement, predefined bias words can be paired with their phoneme representations as pronunciation cues. Typically, phoneme sequences are generated through a G2P system that covers the target languages and domains of the bias words. Therefore, when a compatible G2P system is unavailable, phoneme-assisted contextual biasing becomes difficult to perform. Moreover, manually adding accurate phoneme sequences requires advanced phonetic knowledge. In this paper, we explore contextual biasing in SLLM based on acoustic cues associated with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗ibm-granite/granite-speech-4.1-2bmodel· 374k dl· ♡ 103374k dl♡ 103
- 🤗ibm-granite/granite-speech-4.1-2b-plusmodel· 17k dl· ♡ 5617k dl♡ 56
- 🤗ibm-granite/granite-speech-4.1-2b-narmodel· 6.8k dl· ♡ 446.8k dl♡ 44
- 🤗konszvi/granite-speech-4.1-2b-plus2model· 235 dl235 dl
- 🤗valoomba/granite-speech-4.1-2b-plus-ONNXmodel· 72 dl· ♡ 172 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
