Wikipedia-based Datasets in Russian Information Retrieval Benchmark RusBEIR

Grigory Kovalev; Natalia Loukachevitch; Mikhail Tikhomirov; Olga Babina; Pavel Mamaev

arXiv:2511.05079·cs.IR·November 10, 2025

Wikipedia-based Datasets in Russian Information Retrieval Benchmark RusBEIR

Grigory Kovalev, Natalia Loukachevitch, Mikhail Tikhomirov, Olga Babina, Pavel Mamaev

PDF

Open Access

TL;DR

This paper introduces new Russian Wikipedia-based datasets for various information retrieval tasks, compares different retrieval models, and analyzes how document length affects retrieval performance, providing valuable resources for future research.

Contribution

The paper presents a novel series of Russian IR datasets from Wikipedia, enabling expanded evaluation and comparison of lexical and neural retrieval models.

Findings

01

Lexical models outperform neural models on full-document retrieval.

02

Neural models better capture semantics in short texts.

03

Combining retrieval with neural reranking improves results.

Abstract

In this paper, we present a novel series of Russian information retrieval datasets constructed from the "Did you know..." section of Russian Wikipedia. Our datasets support a range of retrieval tasks, including fact-checking, retrieval-augmented generation, and full-document retrieval, by leveraging interesting facts and their referenced Wikipedia articles annotated at the sentence level with graded relevance. We describe the methodology for dataset creation that enables the expansion of existing Russian Information Retrieval (IR) resources. Through extensive experiments, we extend the RusBEIR research by comparing lexical retrieval models, such as BM25, with state-of-the-art neural architectures fine-tuned for Russian, as well as multilingual models. Results of our experiments show that lexical methods tend to outperform neural models on full-document retrieval, while neural approaches…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInformation Retrieval and Search Behavior · Topic Modeling · Text Readability and Simplification