Boosting ASR Robustness via Test-Time Reinforcement Learning with Audio-Text Semantic Rewards
Linghan Fang, Tianxin Xie, Li Liu

TL;DR
This paper introduces ASR-TRA, a test-time reinforcement learning framework that improves speech recognition robustness by using semantic rewards from audio-text alignment, without ground-truth labels.
Contribution
The paper proposes a novel test-time adaptation method for ASR that leverages reinforcement learning with semantic rewards, addressing confirmation bias in existing approaches.
Findings
Achieves higher accuracy on noisy and accented speech datasets.
Maintains lower latency compared to existing TTA methods.
Enhances stability and interpretability of ASR adaptation.
Abstract
Recently, Automatic Speech Recognition (ASR) systems (e.g., Whisper) have achieved remarkable accuracy improvements but remain highly sensitive to real-world unseen data (data with large distribution shifts), including noisy environments and diverse accents. To address this issue, test-time adaptation (TTA) has shown great potential in improving the model adaptability at inference time without ground-truth labels, and existing TTA methods often rely on pseudo-labeling or entropy minimization. However, by treating model confidence as a learning signal, these methods may reinforce high-confidence errors, leading to confirmation bias that undermines adaptation. To overcome these limitations, we present ASR-TRA, a novel Test-time Reinforcement Adaptation framework inspired by causal intervention. More precisely, our method introduces a learnable decoder prompt and utilizes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Machine Learning and Data Classification
