WIKIR: A Python toolkit for building a large-scale Wikipedia-based   English Information Retrieval Dataset

Jibril Frej; Didier Schwab; Jean-Pierre Chevallet

arXiv:1912.01901·cs.IR·March 18, 2020·1 cites

WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset

Jibril Frej, Didier Schwab, Jean-Pierre Chevallet

PDF

Open Access 1 Repo

TL;DR

WIKIR is an open-source Python toolkit that automatically constructs large-scale Wikipedia-based English information retrieval datasets, addressing the scarcity of extensive annotated data for training deep learning IR models.

Contribution

The paper introduces WIKIR, a novel toolkit for creating large-scale IR datasets from Wikipedia, along with two publicly available datasets, wikIR78k and wikIRS78k.

Findings

01

Provides two large-scale IR datasets with over 78,000 queries.

02

Enables training deep learning IR models with more data.

03

Improves reproducibility and research progress in IR.

Abstract

Over the past years, deep learning methods allowed for new state-of-the-art results in ad-hoc information retrieval. However such methods usually require large amounts of annotated data to be effective. Since most standard ad-hoc information retrieval datasets publicly available for academic research (e.g. Robust04, ClueWeb09) have at most 250 annotated queries, the recent deep learning models for information retrieval perform poorly on these datasets. These models (e.g. DUET, Conv-KNRM) are trained and evaluated on data collected from commercial search engines not publicly available for academic research which is a problem for reproducibility and the advancement of research. In this paper, we propose WIKIR: an open-source toolkit to automatically build large-scale English information retrieval datasets based on Wikipedia. WIKIR is publicly available on GitHub. We also provide wikIR78k…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

getalp/wikIR
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies