Efficiently Identifying Low-Quality Language Subsets in Multilingual   Datasets: A Case Study on a Large-Scale Multilingual Audio Dataset

Farhan Samir; Emily P. Ahn; Shreya Prakash; M\'arton Soskuthy; Vered; Shwartz; Jian Zhu

arXiv:2410.04292·cs.CL·October 8, 2024

Efficiently Identifying Low-Quality Language Subsets in Multilingual Datasets: A Case Study on a Large-Scale Multilingual Audio Dataset

Farhan Samir, Emily P. Ahn, Shreya Prakash, M\'arton Soskuthy, Vered, Shwartz, Jian Zhu

PDF

Open Access

TL;DR

This paper presents a statistical test to identify unreliable language subsets in large multilingual audio datasets, improving downstream phonetic transcription accuracy by filtering out low-quality data.

Contribution

Introduction of the Preference Proportion Test for detecting unreliable language subsets with minimal annotation effort.

Findings

01

Identified systematic transcription errors in 10 language subsets

02

Filtering low-quality data improved phonetic transcription accuracy by 25.7%

03

Method enables scalable, reliable multilingual dataset auditing

Abstract

Curating datasets that span multiple languages is challenging. To make the collection more scalable, researchers often incorporate one or more imperfect classifiers in the process, like language identification models. These models, however, are prone to failure, resulting in some language subsets being unreliable for downstream tasks. We introduce a statistical test, the Preference Proportion Test, for identifying such unreliable subsets. By annotating only 20 samples for a language subset, we're able to identify systematic transcription errors for 10 language subsets in a recent large multilingual transcribed audio dataset, X-IPAPack (Zhu et al., 2024). We find that filtering this low-quality data out when training models for the downstream task of phonetic transcription brings substantial benefits, most notably a 25.7% relative improvement on transcribing recordings in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing