MC$^2$: Towards Transparent and Culturally-Aware NLP for Minority Languages in China
Chen Zhang, Mingxu Tao, Quzhe Huang, Jiuheng Lin, Zhibin Chen, Yansong, Feng

TL;DR
This paper introduces MC$^2$, the largest open-source multilingual corpus for minority languages in China, aiming to improve NLP understanding of underrepresented languages through a quality-focused, culturally-aware dataset.
Contribution
The creation of MC$^2$, a comprehensive, high-quality corpus for Tibetan, Uyghur, Kazakh, and Mongolian, including scripts often neglected in previous efforts, with a focus on cultural and linguistic diversity.
Findings
MC$^2$ enables better NLP performance on minority languages.
The corpus includes underrepresented scripts like Kazakh Arabic and traditional Mongolian.
Public release of MC$^2$ and related models supports future research.
Abstract
Current large language models demonstrate deficiencies in understanding low-resource languages, particularly the minority languages in China. This limitation stems from the scarcity of available pre-training data. To address this accessibility challenge, we present MC, a Multilingual Corpus of Minority Languages in China, which is the largest open-source corpus of its kind so far. MC includes four underrepresented languages: Tibetan, Uyghur, Kazakh, and Mongolian. Notably, we focus on the less common writing systems of Kazakh and Mongolian, i.e., Kazakh Arabic script and traditional Mongolian script, respectively, which have been long neglected in previous corpus construction efforts. Recognizing the prevalence of language contamination within existing corpora, we adopt a quality-centric solution for collecting MC, prioritizing accuracy while enhancing diversity.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
MethodsFocus
