Enabling Scalable Evaluation of Bias Patterns in Medical LLMs
Hamed Fayyaz, Raphael Poulain, Rahmatollah Beheshti

TL;DR
This paper introduces a scalable, automated method for evaluating bias in medical language models by generating test cases from medical knowledge, enabling more comprehensive bias detection in high-stakes medical AI applications.
Contribution
The authors develop a novel pipeline that leverages medical knowledge graphs and ontologies to automatically generate bias test cases, addressing limitations of manual dataset creation in medical bias evaluation.
Findings
Generated test cases effectively reveal bias patterns in medical LLMs.
The approach scales bias evaluation to larger, more diverse medical scenarios.
Published a large bias evaluation dataset for medical case studies.
Abstract
Large language models (LLMs) have shown impressive potential in helping with numerous medical challenges. Deploying LLMs in high-stakes applications such as medicine, however, brings in many concerns. One major area of concern relates to biased behaviors of LLMs in medical applications, leading to unfair treatment of individuals. To pave the way for the responsible and impactful deployment of Med LLMs, rigorous evaluation is a key prerequisite. Due to the huge complexity and variability of different medical scenarios, existing work in this domain has primarily relied on using manually crafted datasets for bias evaluation. In this study, we present a new method to scale up such bias evaluations by automatically generating test cases based on rigorous medical evidence. We specifically target the challenges of a) domain-specificity of bias characterization, b) hallucinating while…
Peer Reviews
Decision·Submitted to ICLR 2025
- The method allows for large-scale evaluation, generating a variety of vignettes quickly without human effort - Retrieval of relevant literature and entities should help reduce hallucinations - Additional hallucination reduction using recent techniques (RefChecker and G-Eval) is encouraging
- The method may still struggle with the nuanced complexities and biases of clinical cases, e.g. with certain races having a higher prevalence of certain symptoms (E.g. asian flush) - The method could inadvertently reinforce existing biases present in biomedical literature. Since it relies on pre-existing knowledge bases, any systemic bias in these resources may propagate through the generated vignettes, leading to an inherent limitation in the bias evaluation framework. - The human evaluation s
The paper presents a new tool for automatically generating clinical vignettes to evaluate bias in medical LLMs, addressing a key challenge in the field. The proposed pipeline incorporates multiple components to ensure the generated vignettes are evidence-based, domain-specific, and reduce hallucinations. The use of biomedical knowledge bases and ontologies helps ground the vignettes in established medical evidence and relationships. The method includes checks for outcome independence and halluci
Limited Scope of Case Studies: While obesity is a good starting point, the paper would be stronger with more diverse medical conditions to demonstrate generalizability. The current focus on a single primary case study leaves questions about how well the method extends to other medical domains. Validation of Medical Accuracy: While the paper uses various computational metrics for evaluation, there's limited validation of the medical accuracy of generated vignettes by practicing clinicians. The h
* The limited scalability of data for red teaming and adversarial testing is a critical and unsolved problem. The core motivation for the work to generate vignettes for red teaming through grounding in the clinical literature and evidence for the presence of associations and/or disparities across patient groups is a strong one. * The approach to grounding and hallucination detection is reasonable and well-executed.
* Clarity and validity of methods related to the outcome independence check * I found the description of the outcome independence check (section 3.4) difficult to follow, even with personal experience working with UMLS, such that it is difficult to assess whether the proposed approach is reasonable. I believe the ambiguity comes from the language, “we especially extract a subset (S_Anc) that belongs to the specified sensitive attributes (such as specific gender or ethnicity)”. Here, it is not
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Semantic Web and Ontologies · Data Mining Algorithms and Applications
