From Oracle to Noisy Context: Mitigating Contextual Exposure Bias in Speech-LLMs

Xiaoyong Guo; Nanjie Li; Zijie Zeng; Kai Wang; Hao Huang; Haihua Xu; Wei Shi

arXiv:2603.24034·cs.CL·March 26, 2026

From Oracle to Noisy Context: Mitigating Contextual Exposure Bias in Speech-LLMs

Xiaoyong Guo, Nanjie Li, Zijie Zeng, Kai Wang, Hao Huang, Haihua Xu, Wei Shi

PDF

Open Access

TL;DR

This paper introduces a unified training framework for Speech-LLMs that mitigates contextual exposure bias caused by error-prone histories at inference, improving robustness and accuracy in speech recognition tasks.

Contribution

It proposes a novel training approach combining Teacher Error Knowledge, Context Dropout, and Direct Preference Optimization to address contextual exposure bias in Speech-LLMs.

Findings

01

Reduced WER from 5.59% to 5.17% on TED-LIUM 3.

02

Improved robustness under irrelevant-context attacks.

03

Consistent gains in out-of-domain zero-shot speech recognition.

Abstract

Contextual automatic speech recognition (ASR) with Speech-LLMs is typically trained with oracle conversation history, but relies on error-prone history at inference, causing a train-test mismatch in the context channel that we term contextual exposure bias. We propose a unified training framework to improve robustness under realistic histories: (i) Teacher Error Knowledge by using Whisper large-v3 hypotheses as training-time history, (ii) Context Dropout to regularize over-reliance on history, and (iii) Direct Preference Optimization (DPO) on curated failure cases. Experiments on TED-LIUM 3 (in-domain) and zero-shot LibriSpeech (out-of-domain) show consistent gains under predicted-history decoding. With a two-utterance history as context, SFT with Whisper hypotheses reduce WER from 5.59% (oracle-history training) to 5.47%, and DPO further improves to 5.17%. Under irrelevant-context…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Adversarial Robustness in Machine Learning · Speech and Audio Processing