Better Than Their Reputation? On the Reliability of Relevance   Assessments with Students

Philipp Schaer

arXiv:1206.4802·cs.IR·May 3, 2017

Better Than Their Reputation? On the Reliability of Relevance Assessments with Students

Philipp Schaer

PDF

TL;DR

This study examines the reliability of relevance assessments made by students in information retrieval evaluations, highlighting the impact of assessor disagreement on evaluation accuracy and recommending filtering unreliable assessments.

Contribution

It provides an analysis of inter-assessor reliability using statistical measures and shows how filtering unreliable assessments improves evaluation consistency.

Findings

01

Average Fleiss' Kappa was 0.37

02

Average Krippendorff's Alpha was 0.15

03

Disagreement significantly affects evaluation reliability

Abstract

During the last three years we conducted several information retrieval evaluation series with more than 180 LIS students who made relevance assessments on the outcomes of three specific retrieval services. In this study we do not focus on the retrieval performance of our system but on the relevance assessments and the inter-assessor reliability. To quantify the agreement we apply Fleiss' Kappa and Krippendorff's Alpha. When we compare these two statistical measures on average Kappa values were 0.37 and Alpha values 0.15. We use the two agreement measures to drop too unreliable assessments from our data set. When computing the differences between the unfiltered and the filtered data set we see a root mean square error between 0.02 and 0.12. We see this as a clear indicator that disagreement affects the reliability of retrieval evaluations. We suggest not to work with unfiltered results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.