Boosting ASR Robustness via Test-Time Reinforcement Learning with Audio-Text Semantic Rewards

Linghan Fang; Tianxin Xie; Li Liu

arXiv:2603.05231·cs.SD·March 6, 2026

Boosting ASR Robustness via Test-Time Reinforcement Learning with Audio-Text Semantic Rewards

Linghan Fang, Tianxin Xie, Li Liu

PDF

Open Access

TL;DR

This paper introduces ASR-TRA, a test-time reinforcement learning framework that improves speech recognition robustness by using semantic rewards from audio-text alignment, without ground-truth labels.

Contribution

The paper proposes a novel test-time adaptation method for ASR that leverages reinforcement learning with semantic rewards, addressing confirmation bias in existing approaches.

Findings

01

Achieves higher accuracy on noisy and accented speech datasets.

02

Maintains lower latency compared to existing TTA methods.

03

Enhances stability and interpretability of ASR adaptation.

Abstract

Recently, Automatic Speech Recognition (ASR) systems (e.g., Whisper) have achieved remarkable accuracy improvements but remain highly sensitive to real-world unseen data (data with large distribution shifts), including noisy environments and diverse accents. To address this issue, test-time adaptation (TTA) has shown great potential in improving the model adaptability at inference time without ground-truth labels, and existing TTA methods often rely on pseudo-labeling or entropy minimization. However, by treating model confidence as a learning signal, these methods may reinforce high-confidence errors, leading to confirmation bias that undermines adaptation. To overcome these limitations, we present ASR-TRA, a novel Test-time Reinforcement Adaptation framework inspired by causal intervention. More precisely, our method introduces a learnable decoder prompt and utilizes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Machine Learning and Data Classification