Data Quality Issues in Multilingual Speech Datasets: The Need for Sociolinguistic Awareness and Proactive Language Planning

Mingfei Lau; Qian Chen; Yeming Fang; Tingting Xu; Tongzhou Chen; Pavel Golik

arXiv:2506.17525·cs.CL·July 1, 2025

Data Quality Issues in Multilingual Speech Datasets: The Need for Sociolinguistic Awareness and Proactive Language Planning

Mingfei Lau, Qian Chen, Yeming Fang, Tingting Xu, Tongzhou Chen, Pavel Golik

PDF

1 Video

TL;DR

This paper audits major multilingual speech datasets, revealing significant quality issues especially in under-resourced languages, and emphasizes the importance of sociolinguistic awareness and proactive language planning for improving dataset quality.

Contribution

It identifies key quality issues in multilingual speech datasets and proposes guidelines emphasizing sociolinguistic awareness and proactive language planning to improve future data collection.

Findings

01

Macro-level issues are common in under-resourced languages.

02

Proactive language planning can improve dataset quality.

03

Guidelines for mitigating quality issues are proposed.

Abstract

Our quality audit for three widely used public multilingual speech datasets - Mozilla Common Voice 17.0, FLEURS, and Vox Populi - shows that in some languages, these datasets suffer from significant quality issues, which may obfuscate downstream evaluation results while creating an illusion of success. We divide these quality issues into two categories: micro-level and macro-level. We find that macro-level issues are more prevalent in less institutionalized, often under-resourced languages. We provide a case analysis of Taiwanese Southern Min (nan_tw) that highlights the need for proactive language planning (e.g. orthography prescriptions, dialect boundary definition) and enhanced data quality control in the dataset creation process. We conclude by proposing guidelines and recommendations to mitigate these issues in future dataset development, emphasizing the importance of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Data Quality Issues in Multilingual Speech Datasets: The Need for Sociolinguistic Awareness and Proactive Language Planning· underline