AC-Mix: Self-Supervised Adaptation for Low-Resource Automatic Speech Recognition using Agnostic Contrastive Mixup

Carlos Carvalho; Alberto Abad

arXiv:2410.14910·eess.AS·August 28, 2025

AC-Mix: Self-Supervised Adaptation for Low-Resource Automatic Speech Recognition using Agnostic Contrastive Mixup

Carlos Carvalho, Alberto Abad

PDF

Open Access

TL;DR

This paper introduces AC-Mix, a contrastive mixup method for self-supervised domain adaptation in low-resource ASR, effectively reducing domain mismatch issues with minimal data and computational resources.

Contribution

The paper presents a novel contrastive mixup approach for self-supervised domain adaptation in low-resource ASR, improving performance with limited data and computational efficiency.

Findings

01

AC-Mix outperforms baseline systems in low-resource ASR tasks.

02

The method requires only 11 hours of adaptation data and 1 hour of training time.

03

Effective in reducing domain mismatch in self-supervised speech models.

Abstract

Self-supervised learning (SSL) leverages large amounts of unlabelled data to learn rich speech representations, fostering improvements in automatic speech recognition (ASR), even when only a small amount of labelled data is available for fine-tuning. Despite the advances in SSL, a significant challenge remains when the data used for pre-training (source domain) mismatches the fine-tuning data (target domain). To tackle this domain mismatch challenge, we propose a new domain adaptation method for low-resource ASR focused on contrastive mixup for joint-embedding architectures named AC-Mix (agnostic contrastive mixup). In this approach, the SSL model is adapted through additional pre-training using mixed data views created by interpolating samples from the source and the target domains. Our proposed adaptation method consistently outperforms the baseline system, using approximately 11…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsMixup