Evaluating Large Language Models for Zero-Shot Disease Labeling in CT Radiology Reports Across Organ Systems

Michael E. Garcia-Alcoser; Mobina GhojoghNejad; Fakrul Islam Tushar; David Kim; Kyle J. Lafata; Geoffrey D. Rubin; Joseph Y. Lo

arXiv:2506.03259·cs.CL·January 9, 2026

Evaluating Large Language Models for Zero-Shot Disease Labeling in CT Radiology Reports Across Organ Systems

Michael E. Garcia-Alcoser, Mobina GhojoghNejad, Fakrul Islam Tushar, David Kim, Kyle J. Lafata, Geoffrey D. Rubin, Joseph Y. Lo

PDF

Open Access

TL;DR

This study evaluates the performance of lightweight large language models in automating disease labeling of CT radiology reports across multiple organ systems, demonstrating their superiority over rule-based methods and their potential for clinical application.

Contribution

It introduces the use of open-weight LLMs for zero-shot multi-disease labeling in CT reports, showing they outperform rule-based algorithms and generalize across organ systems.

Findings

01

Llama-3.1 8B and Gemma-3 27B achieved highest agreement scores.

02

Lightweight LLMs outperformed rule-based algorithms in macro-F1 scores.

03

Models generalized well across different datasets and organ systems.

Abstract

Purpose: This study aims to evaluate the effectiveness of large language models (LLMs) in automating disease annotation of CT radiology reports. We compare a rule-based algorithm (RBA), RadBERT, and three lightweight open-weight LLMs for multi-disease labeling of chest, abdomen, and pelvis (CAP) CT reports. Materials and Methods: This retrospective study analyzed 40,833 chest-abdomen-pelvis (CAP) CT reports from 29,540 patients, with 1,789 reports manually annotated across three organ systems. External validation was conducted using the CT RATE dataset. Three open-weight LLMs were tested with zero-shot prompting. Performance was evaluated using Cohen's Kappa ( $κ$ ) and micro/macro-averaged F1 scores. Results: In the internal test set of 12,197 CAP reports from 8,854 patients, Llama-3.1 8B and Gemma-3 27B showed the highest agreement ( $κ$ median: 0.87). On the manually…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Radiology practices and education · Radiomics and Machine Learning in Medical Imaging