Predicting Relevance based on Assessor Disagreement: Analysis and Practical Applications for Search Evaluation
Thomas Demeester, Robin Aly, Djoerd Hiemstra, Dong Nguyen, Chris, Develder

TL;DR
This paper introduces the Predicted Relevance Model (PRM), a method to predict user relevance by accounting for assessor disagreement, improving the robustness and interpretability of search engine evaluation metrics.
Contribution
The paper proposes the PRM, a novel approach that models assessor disagreement to better estimate user relevance and enhance evaluation metrics for search systems.
Findings
PRM improves relevance prediction accuracy.
Enhanced evaluation metrics with data-driven gain values.
PRM demonstrates effectiveness on multiple test collections.
Abstract
Evaluation of search engines relies on assessments of search results for selected test queries, from which we would ideally like to draw conclusions in terms of relevance of the results for general (e.g., future, unknown) users. In practice however, most evaluation scenarios only allow us to conclusively determine the relevance towards the particular assessor that provided the judgments. A factor that cannot be ignored when extending conclusions made from assessors towards users, is the possible disagreement on relevance, assuming that a single gold truth label does not exist. This paper presents and analyzes the Predicted Relevance Model (PRM), which allows predicting a particular result's relevance for a random user, based on an observed assessment and knowledge on the average disagreement between assessors. With the PRM, existing evaluation metrics designed to measure binary assessor…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
