Calibrated? Not for Everyone: How Sexual Orientation and Religious Markers Distort LLM Accuracy and Confidence in Medical QA
Alberto Testoni, Iacer Calixto

TL;DR
This study reveals that social identity markers like sexual orientation and religious affiliation significantly impair the accuracy and confidence calibration of large language models in medical question answering, risking unsafe clinical deployment.
Contribution
It uncovers how social descriptors distort LLM calibration and accuracy, highlighting risks in equitable healthcare AI deployment.
Findings
Identity markers cause performance drops in LLMs.
Intersectional identities produce non-additive calibration harms.
Failures persist in open-ended generation settings.
Abstract
Safe clinical deployment of Large Language Models (LLMs) requires not only high accuracy but also robust uncertainty calibration to ensure models defer to clinicians when appropriate. Our paper investigates how social descriptors of a patient (specifically sexual orientation and religious affiliation) distort these uncertainty signals and model accuracy. Evaluating nine general-purpose and biomedical LLMs on 2,364 medical questions and their counterfactual variants, we demonstrate that identity markers cause a "calibration crisis". "Homosexual" markers consistently trigger performance drops, and intersectional identities produce idiosyncratic, non-additive harms to calibration. Moreover, a clinician-validated case study in an open-ended generation setting confirms that these failures are not an artifact of the multiple-choice format. Our results demonstrate that the presence of social…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
