TL;DR
This study systematically evaluates neural retrieval models on the Touché 2020 argument retrieval dataset, revealing biases towards short passages and data quality issues, and demonstrates that data denoising improves neural model performance but BM25 remains superior.
Contribution
It provides a detailed reproducibility analysis of neural retrieval models on Touché 2020, highlighting biases and data issues, and proposes data denoising to improve neural retrieval effectiveness.
Findings
Neural models favor short passages, often non-argumentative.
Many neural model results are based on unjudged data.
Denoising improves neural model performance by up to 0.52 in nDCG@10.
Abstract
The zero-shot effectiveness of neural retrieval models is often evaluated on the BEIR benchmark -- a combination of different IR evaluation datasets. Interestingly, previous studies found that particularly on the BEIR subset Touch\'e 2020, an argument retrieval task, neural retrieval models are considerably less effective than BM25. Still, so far, no further investigation has been conducted on what makes argument retrieval so "special". To more deeply analyze the respective potential limits of neural retrieval models, we run a reproducibility study on the Touch\'e 2020 data. In our study, we focus on two experiments: (i) a black-box evaluation (i.e., no model retraining), incorporating a theoretical exploration using retrieval axioms, and (ii) a data denoising evaluation involving post-hoc relevance judgments. Our black-box evaluation reveals an inherent bias of neural models towards…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsFocus
