"All that Glitters": Approaches to Evaluations with Unreliable Model and Human Annotations
Michael Hardy

TL;DR
This paper explores methods to evaluate and improve the reliability of model and human annotations in tasks with inherently noisy labels, using large language models and novel assessment metrics across multiple quality dimensions.
Contribution
It introduces new evaluation approaches for label quality in noisy settings and demonstrates how LLMs can both outperform humans and reveal biases in classroom annotation tasks.
Findings
Encoder models achieve state-of-the-art results, even surpassing human performance.
Standard metrics can mask issues like biases and spurious correlations.
Using rigorous evaluation reveals racial biases and impacts human-model collaboration.
Abstract
"Gold" and "ground truth" human-mediated labels have error. The effects of this error can escape commonly reported metrics of label quality or obscure questions of accuracy, bias, fairness, and usefulness during model evaluation. This study demonstrates methods for answering such questions even in the context of very low reliabilities from expert humans. We analyze human labels, GPT model ratings, and transformer encoder model annotations describing the quality of classroom teaching, an important, expensive, and currently only human task. We answer the question of whether such a task can be automated using two Large Language Model (LLM) architecture families--encoders and GPT decoders, using novel approaches to evaluating label quality across six dimensions: Concordance, Confidence, Validity, Bias, Fairness, and Helpfulness. First, we demonstrate that using standard metrics in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Dense Connections · Dropout · Discriminative Fine-Tuning · Cosine Annealing · Linear Layer · Attention Dropout · Layer Normalization · Byte Pair Encoding
