Impact of Label Noise from Large Language Models Generated Annotations on Evaluation of Diagnostic Model Performance

Mohammadreza Chavoshi; Hari Trivedi; Janice Newsome; Aawez Mansuri; Chiratidzo Rudado Sanyika; Rohan Satya Isaac; Frank Li; Theo Dapamede; Judy Gichoya

arXiv:2506.07273·stat.ME·April 10, 2026

Impact of Label Noise from Large Language Models Generated Annotations on Evaluation of Diagnostic Model Performance

Mohammadreza Chavoshi, Hari Trivedi, Janice Newsome, Aawez Mansuri, Chiratidzo Rudado Sanyika, Rohan Satya Isaac, Frank Li, Theo Dapamede, Judy Gichoya

PDF

TL;DR

This study quantifies how label noise from large language models affects the evaluation of diagnostic AI models, revealing prevalence-dependent biases that can misrepresent true performance.

Contribution

It introduces a simulation framework to analyze the impact of LLM label errors on diagnostic model evaluation, highlighting the importance of error characterization.

Findings

01

LLM label noise causes systematic bias in performance estimates.

02

Bias magnitude depends on disease prevalence and label quality.

03

Monte Carlo simulations show consistent downward bias in observed performance.

Abstract

Large language models (LLMs) are increasingly used to generate labels from radiology reports to enable large-scale AI evaluation. However, label noise from LLMs can introduce bias into performance estimates, especially under varying disease prevalence and model quality. This study quantifies how LLM labeling errors impact downstream diagnostic model evaluation. We developed a simulation framework to assess how LLM label errors affect observed model performance. A synthetic dataset of 10,000 cases was generated across different prevalence levels. LLM sensitivity and specificity were varied independently between 90% and 100%. We simulated diagnostic models with true sensitivity and specificity ranging from 90% to 100%. Observed performance was computed using LLM-generated labels as the reference. We derived analytical performance bounds and ran 5,000 Monte Carlo trials per condition to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.