Counterfactual Cultural Cues Reduce Medical QA Accuracy in LLMs: Identifier vs Context Effects

Amirhossein Haji Mohammad Rezaei; Zahra Shakeri

arXiv:2601.20102·cs.CL·January 29, 2026

Counterfactual Cultural Cues Reduce Medical QA Accuracy in LLMs: Identifier vs Context Effects

Amirhossein Haji Mohammad Rezaei, Zahra Shakeri

PDF

Open Access

TL;DR

This study demonstrates that cultural cues embedded in medical questions can significantly reduce the accuracy of large language models, especially when identifiers and contextual cues co-occur, highlighting challenges in equitable healthcare AI.

Contribution

The paper introduces a counterfactual benchmark with culturally varied test items to evaluate and reveal the impact of cultural cues on medical LLM accuracy, providing tools for assessment and mitigation.

Findings

01

Cultural cues significantly decrease model accuracy (p<10^-14).

02

Largest accuracy drops occur when identifiers and context co-occur.

03

Over half of culturally grounded explanations lead to incorrect answers.

Abstract

Engineering sustainable and equitable healthcare requires medical language models that do not change clinically correct diagnoses when presented with non-decisive cultural information. We introduce a counterfactual benchmark that expands 150 MedQA test items into 1650 variants by inserting culture-related (i) identifier tokens, (ii) contextual cues, or (iii) their combination for three groups (Indigenous Canadian, Middle-Eastern Muslim, Southeast Asian), plus a length-matched neutral control, where a clinician verified that the gold answer remains invariant in all variants. We evaluate GPT-5.2, Llama-3.1-8B, DeepSeek-R1, and MedGemma (4B/27B) under option-only and brief-explanation prompting. Across models, cultural cues significantly affect accuracy (Cochran's Q, $p < 1 0^{-} 14$ ), with the largest degradation when identifier and context co-occur (up to 3-7 percentage points under…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Machine Learning in Healthcare