FairMedQA: Benchmarking Bias in Large Language Models for Medical Question Answering
Ying Xiao, Jie Huang, Ruijuan He, Jing Xiao, Mohammad Reza Mousavi, Yepang Liu, Kezhi Li, Zhenpeng Chen, Jie M. Zhang

TL;DR
This paper introduces FairMedQA, a new benchmark for measuring biases in large language models used for medical question answering, revealing significant disparities and exposing limitations of existing benchmarks.
Contribution
The paper presents FairMedQA, a comprehensive benchmark with 4,806 question pairs, to evaluate bias in 12 LLMs, and demonstrates its superior sensitivity over previous benchmarks.
Findings
Substantial accuracy disparities across demographic groups (3-19%)
FairMedQA detects biases at least 12% larger than previous benchmarks
Highlights urgent need for debiasing and validation in clinical LLM applications
Abstract
Large language models (LLMs) are approaching expert-level performance in medical question answering (QA), demonstrating strong potential to improve public healthcare. However, underlying biases related to sensitive attributes such as sex and race pose life-critical risks. The extent to which such sensitive attributes affect diagnosis remains an open question and requires comprehensive empirical investigation. Additionally, even the latest Counterfactual Patient Variations (CPV) benchmark can hardly distinguish the bias levels of different LLMs. To further explore these dynamics, we propose a new benchmark, FairMedQA, and benchmark 12 representative LLMs. FairMedQA contains 4,806 counterfactual question pairs constructed from 801 clinical vignettes. Our results reveal substantial accuracy disparity ranging from 3 to 19 percentage points across sensitive demographic groups. Notably,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsQuality and Management Systems · Quality and Safety in Healthcare
MethodsLinear Layer · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Multi-Head Attention · Attention Is All You Need · Layer Normalization · Byte Pair Encoding
