FairMedQA: Benchmarking Bias in Large Language Models for Medical Question Answering

Ying Xiao; Jie Huang; Ruijuan He; Jing Xiao; Mohammad Reza Mousavi; Yepang Liu; Kezhi Li; Zhenpeng Chen; Jie M. Zhang

arXiv:2505.19562·cs.AI·January 13, 2026

FairMedQA: Benchmarking Bias in Large Language Models for Medical Question Answering

Ying Xiao, Jie Huang, Ruijuan He, Jing Xiao, Mohammad Reza Mousavi, Yepang Liu, Kezhi Li, Zhenpeng Chen, Jie M. Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces FairMedQA, a new benchmark for measuring biases in large language models used for medical question answering, revealing significant disparities and exposing limitations of existing benchmarks.

Contribution

The paper presents FairMedQA, a comprehensive benchmark with 4,806 question pairs, to evaluate bias in 12 LLMs, and demonstrates its superior sensitivity over previous benchmarks.

Findings

01

Substantial accuracy disparities across demographic groups (3-19%)

02

FairMedQA detects biases at least 12% larger than previous benchmarks

03

Highlights urgent need for debiasing and validation in clinical LLM applications

Abstract

Large language models (LLMs) are approaching expert-level performance in medical question answering (QA), demonstrating strong potential to improve public healthcare. However, underlying biases related to sensitive attributes such as sex and race pose life-critical risks. The extent to which such sensitive attributes affect diagnosis remains an open question and requires comprehensive empirical investigation. Additionally, even the latest Counterfactual Patient Variations (CPV) benchmark can hardly distinguish the bias levels of different LLMs. To further explore these dynamics, we propose a new benchmark, FairMedQA, and benchmark 12 representative LLMs. FairMedQA contains 4,806 counterfactual question pairs constructed from 801 clinical vignettes. Our results reveal substantial accuracy disparity ranging from 3 to 19 percentage points across sensitive demographic groups. Notably,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xy-showing/amqa
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsQuality and Management Systems · Quality and Safety in Healthcare

MethodsLinear Layer · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Multi-Head Attention · Attention Is All You Need · Layer Normalization · Byte Pair Encoding