Speech-Aware Long Context Pruning and Integration for Contextualized Automatic Speech Recognition

Yiming Rong; Yixin Zhang; Ziyi Wang; Deyang Jiang; Yunlong Zhao; Haoran Wu; Shiyu Zhou; Bo Xu

arXiv:2511.11139·cs.CL·January 26, 2026

Speech-Aware Long Context Pruning and Integration for Contextualized Automatic Speech Recognition

Yiming Rong, Yixin Zhang, Ziyi Wang, Deyang Jiang, Yunlong Zhao, Haoran Wu, Shiyu Zhou, Bo Xu

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces SAP$^{2}$, a novel speech-aware framework that dynamically prunes and integrates relevant context in ASR, significantly improving recognition accuracy in long-context scenarios like conferences.

Contribution

The paper proposes a new two-stage framework with speech-driven attention pooling for effective long-context integration in ASR, achieving state-of-the-art results.

Findings

01

Achieves 7.71% WER on SlideSpeech and 1.12% on LibriSpeech.

02

Reduces biased keyword error rates by 41.1% on SlideSpeech.

03

Maintains robust performance with extensive contextual input.

Abstract

Automatic speech recognition (ASR) systems have achieved remarkable performance in common conditions but often struggle to leverage long-context information in contextualized scenarios that require domain-specific knowledge, such as conference presentations. This challenge arises primarily due to constrained model context windows and the sparsity of relevant information within extensive contextual noise. To solve this, we propose the SAP $^{2}$ method, a novel framework that dynamically prunes and integrates relevant contextual keywords in two stages. Specifically, each stage leverages our proposed Speech-Driven Attention-based Pooling mechanism, enabling efficient compression of context embeddings while preserving speech-salient information. Experimental results demonstrate state-of-the-art performance of SAP $^{2}$ on the SlideSpeech and LibriSpeech datasets, achieving word error rates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

jymh/SAP2-ASR
dataset· 65 dl
65 dl

Videos

Speech-Aware Long Context Pruning and Integration for Contextualized Automatic Speech Recognition· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing