Cross-lingual Data Selection Using Clip-level Acoustic Similarity for Enhancing Low-resource Automatic Speech Recognition

Shunsuke Mitsumori; Sara Kashiwagi; Keitaro Tanaka; Shigeo Morishima

arXiv:2506.22194·eess.AS·June 30, 2025·Interspeech

Cross-lingual Data Selection Using Clip-level Acoustic Similarity for Enhancing Low-resource Automatic Speech Recognition

Shunsuke Mitsumori, Sara Kashiwagi, Keitaro Tanaka, Shigeo Morishima

PDF

Open Access

TL;DR

This paper introduces a fine-grained clip-wise acoustic similarity method called CATDS to improve low-resource ASR by selecting more relevant donor speech clips, outperforming traditional selection techniques.

Contribution

The paper proposes a novel clip-level acoustic similarity measure aligned with SSL model representations, enhancing donor data selection for low-resource ASR.

Findings

01

CATDS outperforms traditional selection methods.

02

It enables effective use of previously detrimental donor languages.

03

Improves ASR accuracy in low-resource settings.

Abstract

This paper presents a novel donor data selection method to enhance low-resource automatic speech recognition (ASR). While ASR performs well in high-resource languages, its accuracy declines in low-resource settings due to limited training data. A common solution is to leverage multilingual self-supervised learning (SSL) models with donor languages. However, existing methods rely on language-level similarity, overlooking clip-level variations. To address this limitation, we propose clip-wise acoustic token distribution similarity (CATDS), a fine-grained selection method that identifies acoustically relevant donor clips for better alignment with the target language. Unlike existing clip-level selection methods, our method aligns with the representation of SSL models and offers more challenging yet valuable samples. Experimental results show that CATDS outperforms traditional selection…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · ICT in Developing Communities · Speech and Audio Processing