Towards a Shared Rubric for Dataset Annotation

Andrew Marc Greene

arXiv:2112.03867·cs.LG·December 8, 2021

Towards a Shared Rubric for Dataset Annotation

Andrew Marc Greene

PDF

Open Access

TL;DR

This paper proposes a voluntary rubric for evaluating and comparing third-party dataset annotation providers to promote higher quality practices and facilitate better decision-making.

Contribution

It introduces a shared rubric that serves as a scorecard, communication tool, and incentive for improving annotation quality among vendors.

Findings

01

Rubric enables comparison of annotation providers' quality.

02

Helps justify higher costs for better annotation quality.

03

Encourages vendors to adopt improved annotation practices.

Abstract

When arranging for third-party data annotation, it can be hard to compare how well the competing providers apply best practices to create high-quality datasets. This leads to a "race to the bottom," where competition based solely on price makes it hard for vendors to charge for high-quality annotation. We propose a voluntary rubric which can be used (a) as a scorecard to compare vendors' offerings, (b) to communicate our expectations of the vendors more clearly and consistently than today, (c) to justify the expense of choosing someone other than the lowest bidder, and (d) to encourage annotation providers to improve their practices.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Privacy-Preserving Technologies in Data · Scientific Computing and Data Management