GPT is Not an Annotator: The Necessity of Human Annotation in Fairness   Benchmark Construction

Virginia K. Felkner; Jennifer A. Thompson; Jonathan May

arXiv:2405.15760·cs.CL·May 27, 2024·1 cites

GPT is Not an Annotator: The Necessity of Human Annotation in Fairness Benchmark Construction

Virginia K. Felkner, Jennifer A. Thompson, Jonathan May

PDF

Open Access 1 Video

TL;DR

This paper demonstrates that GPT-3.5-Turbo cannot reliably replace human annotators in creating social bias benchmarks, especially for sensitive topics like antisemitism, highlighting the importance of human judgment.

Contribution

It provides empirical evidence that GPT-3.5-Turbo performs poorly in annotating social bias data, emphasizing the continued necessity of human annotation for fairness benchmarks.

Findings

01

GPT-3.5-Turbo produces unacceptable quality in bias annotation tasks.

02

Automated assistance does not match human judgment in sensitive social bias detection.

03

Using GPT-3.5-Turbo undermines community-sourced bias benchmark efforts.

Abstract

Social biases in LLMs are usually measured via bias benchmark datasets. Current benchmarks have limitations in scope, grounding, quality, and human effort required. Previous work has shown success with a community-sourced, rather than crowd-sourced, approach to benchmark development. However, this work still required considerable effort from annotators with relevant lived experience. This paper explores whether an LLM (specifically, GPT-3.5-Turbo) can assist with the task of developing a bias benchmark dataset from responses to an open-ended community survey. We also extend the previous work to a new community and set of biases: the Jewish community and antisemitism. Our analysis shows that GPT-3.5-Turbo has poor performance on this annotation task and produces unacceptable quality issues in its output. Thus, we conclude that GPT-3.5-Turbo is not an appropriate substitute for human…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

GPT is Not an Annotator: The Necessity of Human Annotation in Fairness Benchmark Construction· underline

Taxonomy

TopicsEthics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Sparse Evolutionary Training · Adam · Dropout · Dense Connections · Softmax · {Dispute@FaQ-s}How to file a dispute with Expedia? · Layer Normalization