Contrastive Regularization for Accent-Robust ASR
Van-Phat Thai, Aradhya Dhruv, Duc-Thinh Pham, Sameer Alam

TL;DR
This paper proposes supervised contrastive learning as a simple, effective regularization method to improve accent robustness in speech recognition systems, reducing error rates on unseen accents.
Contribution
It introduces a contrastive regularization technique for CTC fine-tuning that enhances accent invariance without changing model architecture or requiring explicit accent labels.
Findings
Achieves up to 29% relative WER reduction on unseen accents.
Promotes more compact and stable encoder representations.
Effective across multiple pretrained encoder models.
Abstract
ASR systems based on self-supervised acoustic pretraining and CTC fine-tuning achieve strong performance on native speech but remain sensitive to accent variability. We investigate supervised contrastive learning (SupCon) as a lightweight, accent-invariant auxiliary objective for CTC fine-tuning. An utterance-level contrastive loss regularizes encoder representations without architectural modification or explicit accent supervision. Experiments on the L2-ARCTIC benchmark show consistent WER reductions across multiple pretrained encoders, with up to 25 -- 29\% relative reduction under unseen-accent evaluation. Analysis using within-transcript cosine dispersion indicates that SupCon promotes more compact and stable representation geometry under accent variability. Overall, SupCon provides an effective and model-agnostic regularization strategy for improving accent robustness.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
