Multimodal Consistency-Guided Reference-Free Data Selection for ASR Accent Adaptation

Ligong Lei; Wenwen Lu; Xudong Pang; Zaokere Kadeer; Aishan Wumaier

arXiv:2602.13263·cs.CL·February 17, 2026

Multimodal Consistency-Guided Reference-Free Data Selection for ASR Accent Adaptation

Ligong Lei, Wenwen Lu, Xudong Pang, Zaokere Kadeer, Aishan Wumaier

PDF

Open Access

TL;DR

This paper proposes a multimodal, reference-free data selection method for improving ASR accent adaptation by reliably choosing pseudo-labeled data using speech-text alignment and WER predictions, reducing the need for labeled data.

Contribution

It introduces a novel multimodal, reference-free data selection pipeline that enhances accent adaptation in ASR systems by effectively filtering pseudo-labeled data without requiring reference transcripts.

Findings

01

Achieves near-supervised WER with only 1.5k selected utterances.

02

Effectively handles cross-domain accent shifts.

03

Outperforms random sampling and recent baselines in experiments.

Abstract

Automatic speech recognition (ASR) systems often degrade on accented speech because acoustic-phonetic and prosodic shifts induce a mismatch to training data, making labeled accent adaptation costly. However, common pseudo-label selection heuristics are largely text-centric (e.g., perplexity (PPL) filtering) and can prefer fluent yet acoustically mismatched hypotheses, leading to error amplification when fine-tuning. To address this, we introduce a multimodal consistency-guided, reference-free data selection pipeline for ASR accent adaptation under a transductive, label-free protocol. The pipeline starts with a target-aware preselection step based on submodular mutual information to improve query relevance and reduce downstream computation. It then generates multiple pseudo-transcriptions per utterance via perturbation-based decoding and scores each hypothesis using two reference-free…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and Audio Processing