TL;DR
This paper explores using dueling bandits algorithms to evaluate neural rankers through human preference judgments, proposing a framework for offline evaluation that accounts for ties and minimizes judgments.
Contribution
It introduces a novel application of dueling bandits for offline human preference-based ranking evaluation and proposes modifications to improve algorithm performance.
Findings
Simulations show one algorithm's potential for human preference judgments.
Modified algorithm performs well in collecting preference data.
Over 10,000 judgments collected for TREC submissions validate the approach.
Abstract
The dramatic improvements in core information retrieval tasks engendered by neural rankers create a need for novel evaluation methods. If every ranker returns highly relevant items in the top ranks, it becomes difficult to recognize meaningful differences between them and to build reusable test collections. Several recent papers explore pairwise preference judgments as an alternative to traditional graded relevance assessments. Rather than viewing items one at a time, assessors view items side-by-side and indicate the one that provides the better response to a query, allowing fine-grained distinctions. If we employ preference judgments to identify the probably best items for each query, we can measure rankers by their ability to place these items as high as possible. We frame the problem of finding best items as a dueling bandits problem. While many papers explore dueling bandits for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
