Counterfactual Cultural Cues Reduce Medical QA Accuracy in LLMs: Identifier vs Context Effects
Amirhossein Haji Mohammad Rezaei, Zahra Shakeri

TL;DR
This study demonstrates that cultural cues embedded in medical questions can significantly reduce the accuracy of large language models, especially when identifiers and contextual cues co-occur, highlighting challenges in equitable healthcare AI.
Contribution
The paper introduces a counterfactual benchmark with culturally varied test items to evaluate and reveal the impact of cultural cues on medical LLM accuracy, providing tools for assessment and mitigation.
Findings
Cultural cues significantly decrease model accuracy (p<10^-14).
Largest accuracy drops occur when identifiers and context co-occur.
Over half of culturally grounded explanations lead to incorrect answers.
Abstract
Engineering sustainable and equitable healthcare requires medical language models that do not change clinically correct diagnoses when presented with non-decisive cultural information. We introduce a counterfactual benchmark that expands 150 MedQA test items into 1650 variants by inserting culture-related (i) identifier tokens, (ii) contextual cues, or (iii) their combination for three groups (Indigenous Canadian, Middle-Eastern Muslim, Southeast Asian), plus a length-matched neutral control, where a clinician verified that the gold answer remains invariant in all variants. We evaluate GPT-5.2, Llama-3.1-8B, DeepSeek-R1, and MedGemma (4B/27B) under option-only and brief-explanation prompting. Across models, cultural cues significantly affect accuracy (Cochran's Q, ), with the largest degradation when identifier and context co-occur (up to 3-7 percentage points under…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Machine Learning in Healthcare
