TL;DR
The paper introduces ir_datasets, a comprehensive, easy-to-use tool that simplifies acquiring, managing, and documenting IR datasets, addressing common challenges in data handling for IR experiments.
Contribution
It presents a new lightweight, robust tool with extensive dataset integration and documentation, enhancing reproducibility and ease of use in IR research.
Findings
Provides access to numerous IR datasets via Python and CLI
Demonstrates integration with IR indexing and experimentation tools
Offers a comprehensive dataset catalog for IR research
Abstract
Managing the data for Information Retrieval (IR) experiments can be challenging. Dataset documentation is scattered across the Internet and once one obtains a copy of the data, there are numerous different data formats to work with. Even basic formats can have subtle dataset-specific nuances that need to be considered for proper use. To help mitigate these challenges, we introduce a new robust and lightweight tool (ir_datasets) for acquiring, managing, and performing typical operations over datasets used in IR. We primarily focus on textual datasets used for ad-hoc search. This tool provides both a Python and command line interface to numerous IR datasets and benchmarks. To our knowledge, this is the most extensive tool of its kind. Integrations with popular IR indexing and experimentation toolkits demonstrate the tool's utility. We also provide documentation of these datasets through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
