How Chinese are Chinese Language Models? The Puzzling Lack of Language Policy in China's LLMs
Andrea W Wen-Yi, Unso Eun Seo Jo, Lu Jia Lin, David Mimno

TL;DR
This paper investigates the apparent lack of explicit language policy in Chinese language models, revealing that despite China's regulation of language use, current LLMs show no clear focus on language diversity, performing similarly across multiple languages.
Contribution
The study provides an empirical analysis of Chinese LLMs' language coverage and highlights the absence of a coherent language policy guiding their development.
Findings
Chinese LLMs perform similarly across diverse languages.
Models mainly focus on English and Mandarin Chinese.
No evidence of deliberate language diversity policy in Chinese LLMs.
Abstract
Contemporary language models are increasingly multilingual, but Chinese LLM developers must navigate complex political and business considerations of language diversity. Language policy in China aims at influencing the public discourse and governing a multi-ethnic society, and has gradually transitioned from a pluralist to a more assimilationist approach since 1949. We explore the impact of these influences on current language technology. We evaluate six open-source multilingual LLMs pre-trained by Chinese companies on 18 languages, spanning a wide range of Chinese, Asian, and Anglo-European languages. Our experiments show Chinese LLMs performance on diverse languages is indistinguishable from international LLMs. Similarly, the models' technical reports also show lack of consideration for pretraining data language coverage except for English and Mandarin Chinese. Examining Chinese AI…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComparative and International Law Studies
