Systematic Evaluation of Neural Retrieval Models on the Touch\'e 2020   Argument Retrieval Subset of BEIR

Nandan Thakur; Luiz Bonifacio; Maik Fr\"obe; Alexander Bondarenko,; Ehsan Kamalloo; Martin Potthast; Matthias Hagen; Jimmy Lin

arXiv:2407.07790·cs.IR·July 11, 2024

Systematic Evaluation of Neural Retrieval Models on the Touch\'e 2020 Argument Retrieval Subset of BEIR

Nandan Thakur, Luiz Bonifacio, Maik Fr\"obe, Alexander Bondarenko,, Ehsan Kamalloo, Martin Potthast, Matthias Hagen, Jimmy Lin

PDF

1 Repo

TL;DR

This study systematically evaluates neural retrieval models on the Touché 2020 argument retrieval dataset, revealing biases towards short passages and data quality issues, and demonstrates that data denoising improves neural model performance but BM25 remains superior.

Contribution

It provides a detailed reproducibility analysis of neural retrieval models on Touché 2020, highlighting biases and data issues, and proposes data denoising to improve neural retrieval effectiveness.

Findings

01

Neural models favor short passages, often non-argumentative.

02

Many neural model results are based on unjudged data.

03

Denoising improves neural model performance by up to 0.52 in nDCG@10.

Abstract

The zero-shot effectiveness of neural retrieval models is often evaluated on the BEIR benchmark -- a combination of different IR evaluation datasets. Interestingly, previous studies found that particularly on the BEIR subset Touch\'e 2020, an argument retrieval task, neural retrieval models are considerably less effective than BM25. Still, so far, no further investigation has been conducted on what makes argument retrieval so "special". To more deeply analyze the respective potential limits of neural retrieval models, we run a reproducibility study on the Touch\'e 2020 data. In our study, we focus on two experiments: (i) a black-box evaluation (i.e., no model retraining), incorporating a theoretical exploration using retrieval axioms, and (ii) a data denoising evaluation involving post-hoc relevance judgments. Our black-box evaluation reveals an inherent bias of neural models towards…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

castorini/touche-error-analysis
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsFocus