Towards a Shared Rubric for Dataset Annotation
Andrew Marc Greene

TL;DR
This paper proposes a voluntary rubric for evaluating and comparing third-party dataset annotation providers to promote higher quality practices and facilitate better decision-making.
Contribution
It introduces a shared rubric that serves as a scorecard, communication tool, and incentive for improving annotation quality among vendors.
Findings
Rubric enables comparison of annotation providers' quality.
Helps justify higher costs for better annotation quality.
Encourages vendors to adopt improved annotation practices.
Abstract
When arranging for third-party data annotation, it can be hard to compare how well the competing providers apply best practices to create high-quality datasets. This leads to a "race to the bottom," where competition based solely on price makes it hard for vendors to charge for high-quality annotation. We propose a voluntary rubric which can be used (a) as a scorecard to compare vendors' offerings, (b) to communicate our expectations of the vendors more clearly and consistently than today, (c) to justify the expense of choosing someone other than the lowest bidder, and (d) to encourage annotation providers to improve their practices.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Privacy-Preserving Technologies in Data · Scientific Computing and Data Management
