Exploring Cross-Utterance Speech Contexts for Conformer-Transducer Speech Recognition Systems

Mingyu Cui; Mengzhe Geng; Jiajun Deng; Chengxi Deng; Jiawen Kang; Shujie Hu; Guinan Li; Tianzi Wang; Zhaoqing Li; Xie Chen; Xunying Liu

arXiv:2508.10456·eess.AS·August 15, 2025

Exploring Cross-Utterance Speech Contexts for Conformer-Transducer Speech Recognition Systems

Mingyu Cui, Mengzhe Geng, Jiajun Deng, Chengxi Deng, Jiawen Kang, Shujie Hu, Guinan Li, Tianzi Wang, Zhaoqing Li, Xie Chen, Xunying Liu

PDF

TL;DR

This paper explores four methods for modeling cross-utterance speech contexts in Conformer-Transducer ASR systems, demonstrating significant improvements in recognition accuracy across multiple languages and datasets.

Contribution

It introduces a novel chunk-based approach for cross-utterance modeling and an efficient batch-training scheme, advancing contextual speech recognition techniques.

Findings

01

Consistent WER/CER reductions up to 0.9%/1.1% across datasets.

02

The proposed methods outperform baseline models without cross-utterance contexts.

03

Performance rivals leading models like Wav2vec2.0-Conformer and Whisper.

Abstract

This paper investigates four types of cross-utterance speech contexts modeling approaches for streaming and non-streaming Conformer-Transformer (C-T) ASR systems: i) input audio feature concatenation; ii) cross-utterance Encoder embedding concatenation; iii) cross-utterance Encoder embedding pooling projection; or iv) a novel chunk-based approach applied to C-T models for the first time. An efficient batch-training scheme is proposed for contextual C-Ts that uses spliced speech utterances within each minibatch to minimize the synchronization overhead while preserving the sequential order of cross-utterance speech contexts. Experiments are conducted on four benchmark speech datasets across three languages: the English GigaSpeech and Mandarin Wenetspeech corpora used in contextual C-T models pre-training; and the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.