Simplified Data Wrangling with ir_datasets

Sean MacAvaney; Andrew Yates; Sergey Feldman; Doug Downey; Arman; Cohan; Nazli Goharian

arXiv:2103.02280·cs.IR·May 11, 2021

Simplified Data Wrangling with ir_datasets

Sean MacAvaney, Andrew Yates, Sergey Feldman, Doug Downey, Arman, Cohan, Nazli Goharian

PDF

1 Repo

TL;DR

The paper introduces ir_datasets, a comprehensive, easy-to-use tool that simplifies acquiring, managing, and documenting IR datasets, addressing common challenges in data handling for IR experiments.

Contribution

It presents a new lightweight, robust tool with extensive dataset integration and documentation, enhancing reproducibility and ease of use in IR research.

Findings

01

Provides access to numerous IR datasets via Python and CLI

02

Demonstrates integration with IR indexing and experimentation tools

03

Offers a comprehensive dataset catalog for IR research

Abstract

Managing the data for Information Retrieval (IR) experiments can be challenging. Dataset documentation is scattered across the Internet and once one obtains a copy of the data, there are numerous different data formats to work with. Even basic formats can have subtle dataset-specific nuances that need to be considered for proper use. To help mitigate these challenges, we introduce a new robust and lightweight tool (ir_datasets) for acquiring, managing, and performing typical operations over datasets used in IR. We primarily focus on textual datasets used for ad-hoc search. This tool provides both a Python and command line interface to numerous IR datasets and benchmarks. To our knowledge, this is the most extensive tool of its kind. Integrations with popular IR indexing and experimentation toolkits demonstrate the tool's utility. We also provide documentation of these datasets through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

allenai/ir_datasets
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.