TL;DR
This study evaluates racial bias in five medical LLMs across two clinical tasks, finding that agentic workflows can reduce explicit bias and emphasizing the importance of multi-metric bias assessment in healthcare AI.
Contribution
It introduces a structured evaluation framework for racial bias in medical LLMs and demonstrates that agentic workflows can mitigate some biases in diagnostic tasks.
Findings
GPT-4.1 showed the smallest deviation from racial distributions in synthetic tasks.
DeepSeek V3 achieved the best overall results in differential diagnosis.
Agentic workflows improved bias metrics, though not uniformly across all metrics.
Abstract
Large language models (LLMs) are increasingly used in clinical settings, raising concerns about racial bias in both generated medical text and clinical reasoning. Existing studies have identified bias in medical LLMs, but many focus on single models and give less attention to mitigation. This study uses the EU AI Act as a governance lens to evaluate five widely used LLMs across two tasks, namely synthetic patient-case generation and differential diagnosis ranking. Using race-stratified epidemiological distributions in the United States and expert differential diagnosis lists as benchmarks, we apply structured prompt templates and a two-part evaluation design to examine implicit and explicit racial bias. All models deviated from observed racial distributions in the synthetic case generation task, with GPT-4.1 showing the smallest overall deviation. In the differential diagnosis task,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
