# AI-Powered Detection of Inappropriate Language in Medical School Curricula

**Authors:** Chiman Salavati, Shannon Song, Scott A. Hale, Roberto E. Montenegro, Shiri Dori-Hacohen, Fabricio Murai

arXiv: 2508.19883 · 2025-08-28

## TL;DR

This study evaluates small language models and large language models for detecting inappropriate language in medical educational materials, finding that fine-tuned small models outperform large models with prompt engineering.

## Contribution

It introduces a comprehensive evaluation of SLMs and LLMs for IUL detection in medical curricula, highlighting the effectiveness of fine-tuned models over prompt-based LLMs.

## Key findings

- SLama-3 8B and 70B underperform compared to SLMs.
- Multilabel classifiers achieve highest accuracy on annotated data.
- Adding unflagged excerpts as negative examples improves classifier performance.

## Abstract

The use of inappropriate language -- such as outdated, exclusionary, or non-patient-centered terms -- medical instructional materials can significantly influence clinical training, patient interactions, and health outcomes. Despite their reputability, many materials developed over past decades contain examples now considered inappropriate by current medical standards. Given the volume of curricular content, manually identifying instances of inappropriate use of language (IUL) and its subcategories for systematic review is prohibitively costly and impractical. To address this challenge, we conduct a first-in-class evaluation of small language models (SLMs) fine-tuned on labeled data and pre-trained LLMs with in-context learning on a dataset containing approximately 500 documents and over 12,000 pages. For SLMs, we consider: (1) a general IUL classifier, (2) subcategory-specific binary classifiers, (3) a multilabel classifier, and (4) a two-stage hierarchical pipeline for general IUL detection followed by multilabel classification. For LLMs, we consider variations of prompts that include subcategory definitions and/or shots. We found that both LLama-3 8B and 70B, even with carefully curated shots, are largely outperformed by SLMs. While the multilabel classifier performs best on annotated data, supplementing training with unflagged excerpts as negative examples boosts the specific classifiers' AUC by up to 25%, making them most effective models for mitigating harmful language in medical curricula.

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/2508.19883/full.md

---
Source: https://tomesphere.com/paper/2508.19883