No Language Data Left Behind: A Comparative Study of CJK Language Datasets in the Hugging Face Ecosystem
Dasol Choi, Woomyoung Park, Youngsook Song

TL;DR
This study analyzes over 3,300 datasets in the Hugging Face ecosystem to understand how cultural, institutional, and research factors influence dataset creation and quality for Chinese, Japanese, and Korean NLP communities, aiming to improve resource sharing and LLM development.
Contribution
It provides a cross-linguistic comparison of CJK datasets, revealing distinct creation patterns and offering practical strategies for enhancing dataset quality and collaboration.
Findings
Chinese datasets are large-scale and institution-driven
Korean NLP datasets are grassroots and community-led
Japanese datasets focus on entertainment and subculture
Abstract
Recent advances in Natural Language Processing (NLP) have underscored the crucial role of high-quality datasets in building large language models (LLMs). However, while extensive resources and analyses exist for English, the landscape for East Asian languages - particularly Chinese, Japanese, and Korean (CJK) - remains fragmented and underexplored, despite these languages together serving over 1.6 billion speakers. To address this gap, we investigate the HuggingFace ecosystem from a cross-linguistic perspective, focusing on how cultural norms, research environments, and institutional practices shape dataset availability and quality. Drawing on more than 3,300 datasets, we employ quantitative and qualitative methods to examine how these factors drive distinct creation and curation patterns across Chinese, Japanese, and Korean NLP communities. Our findings highlight the large-scale and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultilingual Education and Policy · China's Ethnic Minorities and Relations · Linguistic Variation and Morphology
