From Oracle to Noisy Context: Mitigating Contextual Exposure Bias in Speech-LLMs
Xiaoyong Guo, Nanjie Li, Zijie Zeng, Kai Wang, Hao Huang, Haihua Xu, Wei Shi

TL;DR
This paper introduces a unified training framework for Speech-LLMs that mitigates contextual exposure bias caused by error-prone histories at inference, improving robustness and accuracy in speech recognition tasks.
Contribution
It proposes a novel training approach combining Teacher Error Knowledge, Context Dropout, and Direct Preference Optimization to address contextual exposure bias in Speech-LLMs.
Findings
Reduced WER from 5.59% to 5.17% on TED-LIUM 3.
Improved robustness under irrelevant-context attacks.
Consistent gains in out-of-domain zero-shot speech recognition.
Abstract
Contextual automatic speech recognition (ASR) with Speech-LLMs is typically trained with oracle conversation history, but relies on error-prone history at inference, causing a train-test mismatch in the context channel that we term contextual exposure bias. We propose a unified training framework to improve robustness under realistic histories: (i) Teacher Error Knowledge by using Whisper large-v3 hypotheses as training-time history, (ii) Context Dropout to regularize over-reliance on history, and (iii) Direct Preference Optimization (DPO) on curated failure cases. Experiments on TED-LIUM 3 (in-domain) and zero-shot LibriSpeech (out-of-domain) show consistent gains under predicted-history decoding. With a two-utterance history as context, SFT with Whisper hypotheses reduce WER from 5.59% (oracle-history training) to 5.47%, and DPO further improves to 5.17%. Under irrelevant-context…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Adversarial Robustness in Machine Learning · Speech and Audio Processing
