WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset
Jibril Frej, Didier Schwab, Jean-Pierre Chevallet

TL;DR
WIKIR is an open-source Python toolkit that automatically constructs large-scale Wikipedia-based English information retrieval datasets, addressing the scarcity of extensive annotated data for training deep learning IR models.
Contribution
The paper introduces WIKIR, a novel toolkit for creating large-scale IR datasets from Wikipedia, along with two publicly available datasets, wikIR78k and wikIRS78k.
Findings
Provides two large-scale IR datasets with over 78,000 queries.
Enables training deep learning IR models with more data.
Improves reproducibility and research progress in IR.
Abstract
Over the past years, deep learning methods allowed for new state-of-the-art results in ad-hoc information retrieval. However such methods usually require large amounts of annotated data to be effective. Since most standard ad-hoc information retrieval datasets publicly available for academic research (e.g. Robust04, ClueWeb09) have at most 250 annotated queries, the recent deep learning models for information retrieval perform poorly on these datasets. These models (e.g. DUET, Conv-KNRM) are trained and evaluated on data collected from commercial search engines not publicly available for academic research which is a problem for reproducibility and the advancement of research. In this paper, we propose WIKIR: an open-source toolkit to automatically build large-scale English information retrieval datasets based on Wikipedia. WIKIR is publicly available on GitHub. We also provide wikIR78k…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies
