Building Trust in Clinical LLMs: Bias Analysis and Dataset Transparency

Svetlana Maslenkova; Clement Christophe; Marco AF Pimentel; Tathagata Raha; Muhammad Umar Salman; Ahmed Al Mahrooqi; Avani Gupta; Shadab Khan; Ronnie Rajan; Praveenkumar Kanithi

arXiv:2510.18556·cs.CL·October 22, 2025

Building Trust in Clinical LLMs: Bias Analysis and Dataset Transparency

Svetlana Maslenkova, Clement Christophe, Marco AF Pimentel, Tathagata Raha, Muhammad Umar Salman, Ahmed Al Mahrooqi, Avani Gupta, Shadab Khan, Ronnie Rajan, Praveenkumar Kanithi

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper investigates biases in clinical language models, especially regarding opioid prescriptions across demographics, introduces a large curated healthcare dataset, and proposes evaluation methods to enhance trust and fairness in clinical AI.

Contribution

It provides an in-depth bias analysis in clinical LLMs, introduces the HC4 dataset, and develops healthcare-specific evaluation methodologies for model fairness.

Findings

01

Identified differential opioid prescription tendencies across demographic groups.

02

Introduced HC4, a large curated healthcare dataset with over 89 billion tokens.

03

Developed healthcare-specific bias evaluation methods.

Abstract

Large language models offer transformative potential for healthcare, yet their responsible and equitable development depends critically on a deeper understanding of how training data characteristics influence model behavior, including the potential for bias. Current practices in dataset curation and bias assessment often lack the necessary transparency, creating an urgent need for comprehensive evaluation frameworks to foster trust and guide improvements. In this study, we present an in-depth analysis of potential downstream biases in clinical language models, with a focus on differential opioid prescription tendencies across diverse demographic groups, such as ethnicity, gender, and age. As part of this investigation, we introduce HC4: Healthcare Comprehensive Commons Corpus, a novel and extensively curated pretraining dataset exceeding 89 billion tokens. Our evaluation leverages both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

m42-health/HC4
dataset· 1.6k dl
1.6k dl

Videos

Building Trust in Clinical LLMs: Bias Analysis and Dataset Transparency· underline

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Topic Modeling