Joint decoding method for controllable contextual speech recognition based on Speech LLM
Yangui Fang, Jing Peng, Yu Xi, Xu Li, Haoyu Li, Chengwei Zhang, Guohui Zhong, Kai Yu

TL;DR
This paper introduces a joint decoding approach for controllable contextual speech recognition using Speech LLM, enabling explicit control over injected context and improving recognition accuracy, including sensitive word suppression.
Contribution
The proposed method allows explicit control of contextual information injection in Speech LLM, overcoming limitations of prompt-based methods and enhancing recognition performance.
Findings
Effective control over contextual information injection.
Improved recognition accuracy with the proposed method.
Speech LLM can acquire long contextual capabilities through joint decoding.
Abstract
Contextual speech recognition refers to the ability to identify preferences for specific content based on contextual information. Recently, leveraging the contextual understanding capabilities of Speech LLM to achieve contextual biasing by injecting contextual information through prompts have emerged as a research hotspot.However, the direct information injection method via prompts relies on the internal attention mechanism of the model, making it impossible to explicitly control the extent of information injection. To address this limitation, we propose a joint decoding method to control the contextual information. This approach enables explicit control over the injected contextual information and achieving superior recognition performance. Additionally, Our method can also be used for sensitive word suppression recognition.Furthermore, experimental results show that even Speech LLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Emotion and Mood Recognition
