Speech-Aware Long Context Pruning and Integration for Contextualized Automatic Speech Recognition
Yiming Rong, Yixin Zhang, Ziyi Wang, Deyang Jiang, Yunlong Zhao, Haoran Wu, Shiyu Zhou, Bo Xu

TL;DR
This paper introduces SAP$^{2}$, a novel speech-aware framework that dynamically prunes and integrates relevant context in ASR, significantly improving recognition accuracy in long-context scenarios like conferences.
Contribution
The paper proposes a new two-stage framework with speech-driven attention pooling for effective long-context integration in ASR, achieving state-of-the-art results.
Findings
Achieves 7.71% WER on SlideSpeech and 1.12% on LibriSpeech.
Reduces biased keyword error rates by 41.1% on SlideSpeech.
Maintains robust performance with extensive contextual input.
Abstract
Automatic speech recognition (ASR) systems have achieved remarkable performance in common conditions but often struggle to leverage long-context information in contextualized scenarios that require domain-specific knowledge, such as conference presentations. This challenge arises primarily due to constrained model context windows and the sparsity of relevant information within extensive contextual noise. To solve this, we propose the SAP method, a novel framework that dynamically prunes and integrates relevant contextual keywords in two stages. Specifically, each stage leverages our proposed Speech-Driven Attention-based Pooling mechanism, enabling efficient compression of context embeddings while preserving speech-salient information. Experimental results demonstrate state-of-the-art performance of SAP on the SlideSpeech and LibriSpeech datasets, achieving word error rates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
