TL;DR
This paper audits major multilingual speech datasets, revealing significant quality issues especially in under-resourced languages, and emphasizes the importance of sociolinguistic awareness and proactive language planning for improving dataset quality.
Contribution
It identifies key quality issues in multilingual speech datasets and proposes guidelines emphasizing sociolinguistic awareness and proactive language planning to improve future data collection.
Findings
Macro-level issues are common in under-resourced languages.
Proactive language planning can improve dataset quality.
Guidelines for mitigating quality issues are proposed.
Abstract
Our quality audit for three widely used public multilingual speech datasets - Mozilla Common Voice 17.0, FLEURS, and Vox Populi - shows that in some languages, these datasets suffer from significant quality issues, which may obfuscate downstream evaluation results while creating an illusion of success. We divide these quality issues into two categories: micro-level and macro-level. We find that macro-level issues are more prevalent in less institutionalized, often under-resourced languages. We provide a case analysis of Taiwanese Southern Min (nan_tw) that highlights the need for proactive language planning (e.g. orthography prescriptions, dialect boundary definition) and enhanced data quality control in the dataset creation process. We conclude by proposing guidelines and recommendations to mitigate these issues in future dataset development, emphasizing the importance of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
