A Toolbox for Surfacing Health Equity Harms and Biases in Large Language   Models

Stephen R. Pfohl; Heather Cole-Lewis; Rory Sayres; Darlene Neal; Mercy; Asiedu; Awa Dieng; Nenad Tomasev; Qazi Mamunur Rashid; Shekoofeh Azizi; Negar; Rostamzadeh; Liam G. McCoy; Leo Anthony Celi; Yun Liu; Mike Schaekermann,; Alanna Walton; Alicia Parrish; Chirag Nagpal; Preeti Singh; Akeiylah Dewitt,; Philip Mansfield; Sushant Prakash; Katherine Heller; Alan Karthikesalingam,; Christopher Semturs; Joelle Barral; Greg Corrado; Yossi Matias; Jamila; Smith-Loud; Ivor Horn; Karan Singhal

arXiv:2403.12025·cs.CY·October 8, 2024·6 cites

A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models

Stephen R. Pfohl, Heather Cole-Lewis, Rory Sayres, Darlene Neal, Mercy, Asiedu, Awa Dieng, Nenad Tomasev, Qazi Mamunur Rashid, Shekoofeh Azizi, Negar, Rostamzadeh, Liam G. McCoy, Leo Anthony Celi, Yun Liu, Mike Schaekermann,, Alanna Walton, Alicia Parrish, Chirag Nagpal

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces a comprehensive framework and datasets for identifying biases in large language models used in healthcare, aiming to improve health equity by surfacing potential harms in model-generated medical answers.

Contribution

It presents a multifactorial human assessment framework and the EquityMedQA dataset, enabling more effective detection of biases in LLMs for medical applications.

Findings

01

Our approach uncovers biases missed by narrower evaluations.

02

Diverse assessment methods and raters improve bias detection.

03

The methodology highlights the importance of participatory review processes.

Abstract

Large language models (LLMs) hold promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities. Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. We present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions and conduct a large-scale empirical case study with the Med-PaLM 2 LLM. Our contributions include a multifactorial framework for human assessment of LLM-generated answers for biases, and EquityMedQA, a collection of seven datasets enriched for adversarial queries. Both our human assessment framework and dataset design process are grounded in an iterative participatory approach and review of Med-PaLM 2 answers. Through our empirical study,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-research/google-research
tfOfficial

Datasets

katielink/EquityMedQA
dataset· 1.9k dl
1.9k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHealthcare Systems and Public Health