"All that Glitters": Approaches to Evaluations with Unreliable Model and   Human Annotations

Michael Hardy

arXiv:2411.15634·cs.CL·November 26, 2024

"All that Glitters": Approaches to Evaluations with Unreliable Model and Human Annotations

Michael Hardy

PDF

Open Access 1 Repo

TL;DR

This paper explores methods to evaluate and improve the reliability of model and human annotations in tasks with inherently noisy labels, using large language models and novel assessment metrics across multiple quality dimensions.

Contribution

It introduces new evaluation approaches for label quality in noisy settings and demonstrates how LLMs can both outperform humans and reveal biases in classroom annotation tasks.

Findings

01

Encoder models achieve state-of-the-art results, even surpassing human performance.

02

Standard metrics can mask issues like biases and spurious correlations.

03

Using rigorous evaluation reveals racial biases and impacts human-model collaboration.

Abstract

"Gold" and "ground truth" human-mediated labels have error. The effects of this error can escape commonly reported metrics of label quality or obscure questions of accuracy, bias, fairness, and usefulness during model evaluation. This study demonstrates methods for answering such questions even in the context of very low reliabilities from expert humans. We analyze human labels, GPT model ratings, and transformer encoder model annotations describing the quality of classroom teaching, an important, expensive, and currently only human task. We answer the question of whether such a task can be automated using two Large Language Model (LLM) architecture families--encoders and GPT decoders, using novel approaches to evaluating label quality across six dimensions: Concordance, Confidence, Validity, Bias, Fairness, and Helpfulness. First, we demonstrate that using standard metrics in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hardy-education/llm-psychometrics
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Dense Connections · Dropout · Discriminative Fine-Tuning · Cosine Annealing · Linear Layer · Attention Dropout · Layer Normalization · Byte Pair Encoding