Estimation of English and non-English Language Use on the WWW
Gregory Grefenstette, Julien Nioche

TL;DR
This paper introduces a technique to estimate the size of language-specific content on the Web by analyzing word frequency, revealing growth trends of European languages and the continued dominance of English.
Contribution
The paper presents a novel method for estimating language-specific web content size using word frequency analysis, applied to track language growth over time.
Findings
Non-English languages are growing faster than English on the Web.
English remains the dominant language despite growth in others.
Web content in European languages has increased significantly from 1996 to 2000.
Abstract
The World Wide Web has grown so big, in such an anarchic fashion, that it is difficult to describe. One of the evident intrinsic characteristics of the World Wide Web is its multilinguality. Here, we present a technique for estimating the size of a language-specific corpus given the frequency of commonly occurring words in the corpus. We apply this technique to estimating the number of words available through Web browsers for given languages. Comparing data from 1996 to data from 1999 and 2000, we calculate the growth of a number of European languages on the Web. As expected, non-English languages are growing at a faster pace than English, though the position of English is still dominant.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb visibility and informetrics · Web Data Mining and Analysis · Complex Network Analysis Techniques
