k-Rater Reliability: The Correct Unit of Reliability for Aggregated   Human Annotations

Ka Wong; Praveen Paritosh

arXiv:2203.12913·cs.AI·March 25, 2022

k-Rater Reliability: The Correct Unit of Reliability for Aggregated Human Annotations

Ka Wong, Praveen Paritosh

PDF

Open Access

TL;DR

This paper introduces k-rater reliability (kRR) as the appropriate measure for assessing the reliability of aggregated human annotations in NLP, highlighting that current practices under-report data reliability.

Contribution

It proposes kRR as a new, correct reliability metric for aggregated data and provides empirical methods for its computation, encouraging its adoption in NLP research.

Findings

01

kRR provides a more accurate reliability measure for aggregated annotations

02

Empirical and bootstrap methods for computing kRR yield consistent results

03

Using kRR can improve the assessment of data quality in NLP datasets

Abstract

Since the inception of crowdsourcing, aggregation has been a common strategy for dealing with unreliable data. Aggregate ratings are more reliable than individual ones. However, many natural language processing (NLP) applications that rely on aggregate ratings only report the reliability of individual ratings, which is the incorrect unit of analysis. In these instances, the data reliability is under-reported, and a proposed k-rater reliability (kRR) should be used as the correct data reliability for aggregated datasets. It is a multi-rater generalization of inter-rater reliability (IRR). We conducted two replications of the WordSim-353 benchmark, and present empirical, analytical, and bootstrap-based methods for computing kRR on WordSim-353. These methods produce very similar results. We hope this discussion will nudge researchers to report kRR in addition to IRR.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems