Developing A Framework to Support Human Evaluation of Bias in Generated   Free Response Text

Jennifer Healey; Laurie Byrum; Md Nadeem Akhtar; Surabhi Bhargava and; Moumita Sinha

arXiv:2505.03053·cs.CL·May 7, 2025

Developing A Framework to Support Human Evaluation of Bias in Generated Free Response Text

Jennifer Healey, Laurie Byrum, Md Nadeem Akhtar, Surabhi Bhargava and, Moumita Sinha

PDF

Open Access

TL;DR

This paper presents a semi-automated framework for evaluating bias in generated free text responses, integrating human insights to improve accuracy and uncover issues in bias benchmarks.

Contribution

It introduces a novel bias evaluation framework that combines automation with human insights, enabling more valid assessments of bias in real-world LLM deployments.

Findings

01

Operational definition of bias enabled automation

02

Methodology for classifying bias beyond multiple choice

03

Uncovered problematic templates in bias benchmarks

Abstract

LLM evaluation is challenging even the case of base models. In real world deployments, evaluation is further complicated by the interplay of task specific prompts and experiential context. At scale, bias evaluation is often based on short context, fixed choice benchmarks that can be rapidly evaluated, however, these can lose validity when the LLMs' deployed context differs. Large scale human evaluation is often seen as too intractable and costly. Here we present our journey towards developing a semi-automated bias evaluation framework for free text responses that has human insights at its core. We discuss how we developed an operational definition of bias that helped us automate our pipeline and a methodology for classifying bias beyond multiple choice. We additionally comment on how human evaluation helped us uncover problematic templates in a bias benchmark.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Reliability and Analysis Research

MethodsBalanced Selection