Goldilocks: Consistent Crowdsourced Scalar Annotations with Relative Uncertainty
Quanze Chen, Daniel S. Weld, Amy X. Zhang

TL;DR
Goldilocks is a new crowd rating method that improves consistency and captures uncertainty by grounding scales with examples and using a two-step bounding process across diverse domains.
Contribution
It introduces a novel elicitation technique that distinguishes inherent ambiguity from annotator disagreement, enhancing the quality of scalar annotations.
Findings
Improves consistency in subjective rating domains.
Captures different sources of uncertainty with item ranges.
Enhances estimates of pairwise relationship distributions.
Abstract
Human ratings have become a crucial resource for training and evaluating machine learning systems. However, traditional elicitation methods for absolute and comparative rating suffer from issues with consistency and often do not distinguish between uncertainty due to disagreement between annotators and ambiguity inherent to the item being rated. In this work, we present Goldilocks, a novel crowd rating elicitation technique for collecting calibrated scalar annotations that also distinguishes inherent ambiguity from inter-annotator disagreement. We introduce two main ideas: grounding absolute rating scales with examples and using a two-step bounding process to establish a range for an item's placement. We test our designs in three domains: judging toxicity of online comments, estimating satiety of food depicted in images, and estimating age based on portraits. We show that (1) Goldilocks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
