Rank-frequency relation for Chinese characters

W.B. Deng; A.E. Allahverdyan; B. Li; Q.A. Wang

arXiv:1309.1536·cs.CL·March 10, 2014

Rank-frequency relation for Chinese characters

W.B. Deng, A.E. Allahverdyan, B. Li, Q.A. Wang

PDF

TL;DR

This paper investigates the rank-frequency distribution of Chinese characters, revealing a Zipfian law for short texts and a hierarchical structure combining Zipfian and exponential regimes for longer texts, paralleling word distributions in English.

Contribution

It provides a detailed analysis of Chinese character frequency distributions, including theoretical models for different text lengths, and compares these patterns to English word distributions.

Findings

01

Zipf's law holds for short Chinese texts.

02

Long texts show a two-layer rank-frequency structure.

03

Chinese characters serve a similar role to English words.

Abstract

We show that the Zipf's law for Chinese characters perfectly holds for sufficiently short texts (few thousand different characters). The scenario of its validity is similar to the Zipf's law for words in short English texts. For long Chinese texts (or for mixtures of short Chinese texts), rank-frequency relations for Chinese characters display a two-layer, hierarchic structure that combines a Zipfian power-law regime for frequent characters (first layer) with an exponential-like regime for less frequent characters (second layer). For these two layers we provide different (though related) theoretical descriptions that include the range of low-frequency characters (hapax legomena). The comparative analysis of rank-frequency relations for Chinese characters versus English words illustrates the extent to which the characters play for Chinese writers the same role as the words for those…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.