# Evaluation of large language models in assigning PI-RADS v2.1 categories for prostate MRI reports

**Authors:** Betul Akdal Dolek, Muhammed Said Besler

PMC · DOI: 10.1186/s12894-025-02038-5 · BMC Urology · 2026-01-02

## TL;DR

This study tested how well large language models can classify prostate MRI reports using the PI-RADS system, finding that GPT-o1 performed best but all models struggled with intermediate risk cases.

## Contribution

The novel contribution is evaluating specific LLMs for PI-RADS v2.1 classification in prostate MRI reports and comparing their performance against radiologist consensus.

## Key findings

- GPT-o1 achieved the highest agreement with radiologists (κ = 0.867) and best F1 scores for low and high-risk groups.
- All models showed weak performance for PI-RADS 3 (intermediate risk) lesions.
- No model produced invalid PI-RADS scores outside the 1–5 range.

## Abstract

This study aimed to evaluate the performance of large language models (LLMs) in classifying prostate MRI reports according to the Prostate Imaging–Reporting and Data System (PIRADS) version 2.1, and to validate their use in supporting clinical decisions in prostate cancer treatment.

This retrospective study included 146 patients. Four LLMs — GPT-4o, GPT-o1, Google Gemini 1.5 Pro and Google Gemini 2.0 Experimental Advanced — were tested on standardised, structured prostate MRI reports. A two-radiologist consensus reference standard was used to compare model performance. Agreement was measured using weighted Cohen’s kappa, and accuracy and F1 scores were calculated for three PI-RADS risk groups: low (1–2), intermediate (3) and high (4–5).

Performance varied by model. GPT-o1 achieved the highest level of agreement with radiologists (κ = 0.867), followed by GPT-4o (κ = 0.743), Gemini 1.5 Pro (κ = 0.728) and Gemini 2.0 Experimental Advanced (κ = 0.664). GPT-o1 achieved the highest F1 scores for the low-risk (0.93) and high-risk (1.00) groups, demonstrating moderate performance for the PI-RADS 3 group (0.75). All models showed weak performance for PI-RADS 3 (F1 range: 0.54–0.75). Most importantly, none of the models produced invalid results outside the target PI-RADS 1–5 range.

LLMs show potential for automating PI-RADS classification from MRI reports, with GPT-o1 demonstrating the best overall performance. However, their failure in PI-RADS 3 lesions indicates that multicentre validation, larger datasets and multimodality integration are needed before they can be used clinically for prostate cancer diagnosis and urological decision-making.

Not applicable. This retrospective study did not involve a clinical trial.

## Linked entities

- **Diseases:** prostate cancer (MONDO:0005159)

## Full-text entities

- **Genes:** AZIN2 (antizyme inhibitor 2) [NCBI Gene 113451] {aka ADC, AZIB1, ODC-p, ODC1L, ODCp}, KLK3 (kallikrein related peptidase 3) [NCBI Gene 354] {aka APS, KLK2A1, PSA, hK3}
- **Diseases:** death (MESH:D003643), Prostate cancer (MESH:D011471), anxiety (MESH:D001007), PI-RADS (MESH:D011472), benign prostatic hyperplasia (MESH:D011470), LLMs (MESH:D007806), hallucinations (MESH:D006212), cancer (MESH:D009369), lung cancer (MESH:D008175)
- **Chemicals:** PI (MESH:D010716), Bard (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12866120/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12866120/full.md

## References

3 references — full list in the complete paper: https://tomesphere.com/paper/PMC12866120/full.md

---
Source: https://tomesphere.com/paper/PMC12866120