Fine-Tune, Don't Prompt, Your Language Model to Identify Biased Language in Clinical Notes

Isotta Landi; Eugenia Alleva; Nicole Bussola; Rebecca M. Cohen; Sarah Nowlin; Leslee J. Shaw; Alexander W. Charney; Kimberly B. Glazer

arXiv:2603.10004·cs.CL·March 12, 2026

Fine-Tune, Don't Prompt, Your Language Model to Identify Biased Language in Clinical Notes

Isotta Landi, Eugenia Alleva, Nicole Bussola, Rebecca M. Cohen, Sarah Nowlin, Leslee J. Shaw, Alexander W. Charney, Kimberly B. Glazer

PDF

Open Access

TL;DR

This study develops a fine-tuning approach for language models to detect biased language in clinical notes, outperforming prompting methods and emphasizing the need for specialty-specific adaptation to ensure accuracy and clinical relevance.

Contribution

It introduces a lexicon-based framework and demonstrates that fine-tuning with lexically primed inputs yields superior bias detection in clinical texts compared to prompting methods.

Findings

01

Fine-tuning outperforms prompting in bias classification.

02

GatorTron achieves 0.96 F1 score on OB-GYN data.

03

Cross-domain generalizability is limited without domain-specific training.

Abstract

Clinical documentation can contain emotionally charged language with stigmatizing or privileging valences. We present a framework for detecting and classifying such language as stigmatizing, privileging, or neutral. We constructed a curated lexicon of biased terms scored for emotional valence. We then used lexicon-based matching to extract text chunks from OB-GYN delivery notes (Mount Sinai Hospital, NY) and MIMIC-IV discharge summaries across multiple specialties. Three clinicians annotated all chunks, enabling characterization of valence patterns across specialties and healthcare systems. We benchmarked multiple classification strategies (zero-shot prompting, in-context learning, and supervised fine-tuning) across encoder-only models (GatorTron) and generative large language models (Llama). Fine-tuning with lexically primed inputs consistently outperformed prompting approaches.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Topic Modeling · Artificial Intelligence in Healthcare and Education