WikiOmnia: generative QA corpus on the whole Russian Wikipedia
Dina Pisarevskaya, Tatiana Shavrina

TL;DR
This paper introduces WikiOmnia, a comprehensive, automatically generated Russian Wikipedia QA dataset, enabling large-scale training and benchmarking for question answering models without extensive manual annotation.
Contribution
It presents a fully automated pipeline for creating large-scale Russian QA datasets from Wikipedia, significantly expanding available training data for Russian NLP tasks.
Findings
The dataset includes over 7.9 million QA pairs for ruGPT-3 XL.
Over 3.4 million QA pairs are verified with strict automatic checks.
The pipeline is adaptable to other domains like news and social media.
Abstract
The General QA field has been developing the methodology referencing the Stanford Question answering dataset (SQuAD) as the significant benchmark. However, compiling factual questions is accompanied by time- and labour-consuming annotation, limiting the training data's potential size. We present the WikiOmnia dataset, a new publicly available set of QA-pairs and corresponding Russian Wikipedia article summary sections, composed with a fully automated generative pipeline. The dataset includes every available article from Wikipedia for the Russian language. The WikiOmnia pipeline is available open-source and is also tested for creating SQuAD-formatted QA on other domains, like news texts, fiction, and social media. The resulting dataset includes two parts: raw data on the whole Russian Wikipedia (7,930,873 QA pairs with paragraphs for ruGPT-3 XL and 7,991,040 QA pairs with paragraphs for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Wikis in Education and Collaboration
