# Harmonizing organ-at-risk structure names using open-source large language models

**Authors:** Adrian Thummerer, Matteo Maspero, Erik van der Bijl, Stefanie Corradini, Claus Belka, Guillaume Landry, Christopher Kurz

PMC · DOI: 10.1016/j.phro.2025.100813 · Physics and Imaging in Radiation Oncology · 2025-07-24

## TL;DR

This paper shows that open-source large language models can accurately rename organs at risk in radiotherapy according to a standard guideline, with one model achieving nearly perfect accuracy.

## Contribution

The study demonstrates the effectiveness of open-source large language models in harmonizing organ-at-risk nomenclature across multilingual and multi-institutional datasets.

## Key findings

- DeepSeek R1 achieved 98.6% unique accuracy in renaming organs at risk.
- Reasoning-enhanced models outperformed non-reasoning models in nomenclature harmonization.
- Monte Carlo uncertainty estimation better detected prediction errors than prompt-based confidence.

## Abstract

•Large language models can rename structures according to the AAPM TG-263 guideline.•Investigated four open source models using a multi-lingual, multi-center dataset.•DeepSeek R1 achieved the highest accuracy with 98.6% correctly renamed structures.•Monte Carlo uncertainty correlated with prediction errors (r = 0.7)

Large language models can rename structures according to the AAPM TG-263 guideline.

Investigated four open source models using a multi-lingual, multi-center dataset.

DeepSeek R1 achieved the highest accuracy with 98.6% correctly renamed structures.

Monte Carlo uncertainty correlated with prediction errors (r = 0.7)

Standardized radiotherapy structure nomenclature is crucial for automation, inter-institutional collaborations, and large-scale deep learning studies in radiation oncology. Despite the availability of nomenclature guidelines (AAPM-TG-263), their implementation is lacking and still faces challenges. This study evaluated open-source large language models (LLMs) for automated organ-at-risk (OAR) renaming on a multi-institutional and multilingual dataset.

Four open-source LLMs (Llama 3.3, Llama 3.3 R1, DeepSeek V3, DeepSeek R1) were evaluated using a dataset of 34,177 OAR structures from 1684 patients collected at three university medical centers with manual TG-263 ground-truth labels. LLM renaming was performed using a few-shot prompting technique, including detailed instructions and generic examples. Performance was assessed by calculating renaming accuracy on the entire dataset and a unique dataset (duplicates removed). In addition, we performed a failure analysis, prompt-based confidence correlation, and Monte Carlo sampling-based uncertainty estimation.

High renaming accuracy was achieved, with the reasoning-enhanced DeepSeek R1 model performing best (98.6 % unique accuracy, 99.9 % overall accuracy). Overall, reasoning models outperformed their non-reasoning counterparts. Monte Carlo sampling showed a stronger correlation with prediction errors (correlation coefficient of 0.70 for DeepSeek R1) and better error detection (Sensitivity 0.73, Specificity 1.0 for DeepSeek R1) compared to prompt-based confidence estimation (correlation coefficient < 0.42).

Open-source LLMs, particularly those with reasoning capabilities, can accurately harmonize OAR nomenclature according to TG-263 across diverse multilingual and multi-institutional datasets. They can also facilitate TG-263 nomenclature adoption and the creation of large, standardized datasets for research and AI development.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12336799/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12336799/full.md

## References

21 references — full list in the complete paper: https://tomesphere.com/paper/PMC12336799/full.md

---
Source: https://tomesphere.com/paper/PMC12336799