Russian Web Tables: A Public Corpus of Web Tables for Russian Language Based on Wikipedia
Platon Fedorov, Alexey Mironov, George Chernishev

TL;DR
This paper introduces the first publicly available corpus of Russian web tables derived from Wikipedia, along with a toolkit for crawling and analyzing such tables, to support research in data extraction and knowledge base construction.
Contribution
The authors created and released the first Russian web table corpus and developed an open-source toolkit for crawling and analyzing Russian Wikipedia tables.
Findings
The corpus contains X tables with Y rows on average.
Russian Wikipedia tables have diverse structures and semantic types.
The toolkit facilitates future research in Russian web data extraction.
Abstract
Corpora that contain tabular data such as WebTables are a vital resource for the academic community. Essentially, they are the backbone of any modern research in information management. They are used for various tasks of data extraction, knowledge base construction, question answering, column semantic type detection and many other. Such corpora are useful not only as a source of data, but also as a base for building test datasets. So far, there were no such corpora for the Russian language and this seriously hindered research in the aforementioned areas. In this paper, we present the first corpus of Web tables created specifically out of Russian language material. It was built via a special toolkit we have developed to crawl the Russian Wikipedia. Both the corpus and the toolkit are open-source and publicly available. Finally, we present a short study that describes Russian Wikipedia…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Natural Language Processing Techniques · Topic Modeling
MethodsTest · Balanced Selection
