Relevance Assessments for Web Search Evaluation: Should We Randomise or Prioritise the Pooled Documents? (CORRECTED VERSION)
Tetsuya Sakai, Sijie Tao, and Zhaohao Zeng

TL;DR
This study compares prioritisation and randomisation strategies for relevance assessment in web search evaluation, analyzing their impact on agreement, robustness, and efficiency using a newly released dataset.
Contribution
It introduces the WWW3E8 dataset and provides a comprehensive comparison of PRI and RND strategies in web search relevance assessments.
Findings
RND yields comparable inter-assessor agreement to PRI.
RND provides more robust system rankings for new systems.
Assessment efficiency differs between strategies, with implications for evaluation protocols.
Abstract
In the context of depth- pooling for constructing web search test collections, we compare two approaches to ordering pooled documents for relevance assessors: the prioritisation strategy (PRI) used widely at NTCIR, and the simple randomisation strategy (RND). In order to address research questions regarding PRI and RND, we have constructed and released the WWW3E8 data set, which contains eight independent relevance labels for 32,375 topic-document pairs, i.e., a total of 259,000 labels. Four of the eight relevance labels were obtained from PRI-based pools; the other four were obtained from RND-based pools. Using WWW3E8, we compare PRI and RND in terms of inter-assessor agreement, system ranking agreement, and robustness to new systems that did not contribute to the pools. We also utilise an assessor activity log we obtained as a byproduct of WWW3E8 to compare the two strategies in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation Retrieval and Search Behavior · Topic Modeling · Expert finding and Q&A systems
MethodsTest
