Diagnosing our datasets: How does my language model learn clinical information?

Furong Jia; David Sontag; Monica Agrawal

arXiv:2505.15024·cs.CL·May 26, 2025

Diagnosing our datasets: How does my language model learn clinical information?

Furong Jia, David Sontag, Monica Agrawal

PDF

1 Repo

TL;DR

This study investigates how open-source large language models learn clinical information by analyzing their understanding of clinical jargon and responses to unsupported medical claims, revealing data mismatches and source influences.

Contribution

It introduces a new dataset MedLingo and provides insights into the relationship between pretraining data and clinical language understanding in LLMs.

Findings

01

Frequency of clinical jargon correlates with model performance.

02

Clinical jargon often underrepresented in pretraining corpora.

03

Models can parrot unsupported medical claims from online sources.

Abstract

Large language models (LLMs) have performed well across various clinical natural language processing tasks, despite not being directly trained on electronic health record (EHR) data. In this work, we examine how popular open-source LLMs learn clinical information from large mined corpora through two crucial but understudied lenses: (1) their interpretation of clinical jargon, a foundational ability for understanding real-world clinical notes, and (2) their responses to unsupported medical claims. For both use cases, we investigate the frequency of relevant clinical information in their corresponding pretraining corpora, the relationship between pretraining data composition and model outputs, and the sources underlying this data. To isolate clinical jargon understanding, we evaluate LLMs on a new dataset MedLingo. Unsurprisingly, we find that the frequency of clinical jargon mentions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Flora-jia-jfr/diagnosing_our_datasets
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.