CLAR: CIF-Localized Alignment for Retrieval-Augmented Speech LLM-Based Contextual ASR
Shangkun Huang, Huan Shen, Wei Zou, Yunzhang Chen

TL;DR
CLAR introduces a novel dual-encoder retriever with CIF-based alignment to improve hotword localization and retrieval in speech LLM-based ASR, significantly enhancing recognition accuracy especially for named entities.
Contribution
The paper presents CLAR, a new retrieval method using CIF for monotonic alignment, enabling more accurate hotword localization and improved contextual ASR performance.
Findings
Significantly improves hotword retrieval accuracy.
Reduces CER and B-WER compared to baselines.
Enhances recognition of named entities in speech.
Abstract
Speech LLM-based ASR often struggles with named entities and long-tail words due to strong internal language-model priors. Retrieval-augmented biasing can help, but its effectiveness depends on accurate hotword localization in full-utterance speech under weak supervision. We propose CLAR, a dual-encoder speech-text retriever that uses Continuous Integrate-and-Fire (CIF) to learn monotonic token-level alignments without timestamps. With length-aware localized matching, CLAR anchors short-entity acoustic cues and reduces representation dilution and attention drift. The retriever is trained with a multi-granularity objective combining global and local segment-level contrastive losses and a CIF quantity constraint. At inference, top-ranked hotwords are injected as contextual prompts for the Speech LLM, improving recognition without shallow fusion. Experiments show that CLAR significantly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and dialogue systems
