Index wiki database: design and experiments
A. A. Krizhanovsky

TL;DR
This paper presents a software architecture for indexing wiki texts in multiple languages, compares two Wikipedia index databases, and analyzes linguistic and growth patterns, providing open-source tools for efficient wiki data retrieval.
Contribution
It introduces a multilingual wiki indexing system, details its architecture, and compares Russian and Simple English Wikipedia indexes, including linguistic analysis and growth trends.
Findings
Russian Wikipedia index is significantly larger than Simple English Wikipedia.
Growth rate of pages in Simple English Wikipedia is 14% higher than Russian.
Zipf's law holds for both Russian and Simple English Wikipedia texts.
Abstract
With the fantastic growth of Internet usage, information search in documents of a special type called a "wiki page" that is written using a simple markup language, has become an important problem. This paper describes the software architectural model for indexing wiki texts in three languages (Russian, English, and German) and the interaction between the software components (GATE, Lemmatizer, and Synarcher). The inverted file index database was designed using visual tool DBDesigner. The rules for parsing Wikipedia texts are illustrated by examples. Two index databases of Russian Wikipedia (RW) and Simple English Wikipedia (SEW) are built and compared. The size of RW is by order of magnitude higher than SEW (number of words, lexemes), though the growth rate of number of pages in SEW was found to be 14% higher than in Russian, and the rate of acquisition of new words in SEW lexicon was 7%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWikis in Education and Collaboration · Web Data Mining and Analysis · Natural Language Processing Techniques
