Wikipedia-based Datasets in Russian Information Retrieval Benchmark RusBEIR
Grigory Kovalev, Natalia Loukachevitch, Mikhail Tikhomirov, Olga Babina, Pavel Mamaev

TL;DR
This paper introduces new Russian Wikipedia-based datasets for various information retrieval tasks, compares different retrieval models, and analyzes how document length affects retrieval performance, providing valuable resources for future research.
Contribution
The paper presents a novel series of Russian IR datasets from Wikipedia, enabling expanded evaluation and comparison of lexical and neural retrieval models.
Findings
Lexical models outperform neural models on full-document retrieval.
Neural models better capture semantics in short texts.
Combining retrieval with neural reranking improves results.
Abstract
In this paper, we present a novel series of Russian information retrieval datasets constructed from the "Did you know..." section of Russian Wikipedia. Our datasets support a range of retrieval tasks, including fact-checking, retrieval-augmented generation, and full-document retrieval, by leveraging interesting facts and their referenced Wikipedia articles annotated at the sentence level with graded relevance. We describe the methodology for dataset creation that enables the expansion of existing Russian Information Retrieval (IR) resources. Through extensive experiments, we extend the RusBEIR research by comparing lexical retrieval models, such as BM25, with state-of-the-art neural architectures fine-tuned for Russian, as well as multilingual models. Results of our experiments show that lexical methods tend to outperform neural models on full-document retrieval, while neural approaches…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation Retrieval and Search Behavior · Topic Modeling · Text Readability and Simplification
