Can open source large language models be used for tumor documentation in Germany? -- An evaluation on urological doctors' notes

Stefan Lenz; Arsenij Ustjanzew; Marco Jeray; Meike Ressing; Torsten Panholzer

arXiv:2501.12106·cs.CL·August 8, 2025

Can open source large language models be used for tumor documentation in Germany? -- An evaluation on urological doctors' notes

Stefan Lenz, Arsenij Ustjanzew, Marco Jeray, Meike Ressing, Torsten Panholzer

PDF

1 Repo 1 Datasets

TL;DR

This study evaluates open source large language models for automating tumor documentation in German urology notes, finding that models with 7-12 billion parameters perform well and could enhance clinical documentation efficiency.

Contribution

The paper provides a comprehensive evaluation of open source LLMs for tumor documentation tasks in German medical texts, introducing a new dataset and benchmarking results.

Findings

01

Models with 7-12 billion parameters perform comparably well.

02

Larger models do not necessarily outperform smaller ones.

03

Few-shot prompting with cross-domain examples improves performance.

Abstract

Tumor documentation in Germany is largely done manually, requiring reading patient records and entering data into structured databases. Large language models (LLMs) could potentially enhance this process by improving efficiency and reliability. This evaluation tests eleven different open source LLMs with sizes ranging from 1-70 billion model parameters on three basic tasks of the tumor documentation process: identifying tumor diagnoses, assigning ICD-10 codes, and extracting the date of first diagnosis. For evaluating the LLMs on these tasks, a dataset of annotated text snippets based on anonymized doctors' notes from urology was prepared. Different prompting strategies were used to investigate the effect of the number of examples in few-shot prompting and to explore the capabilities of the LLMs in general. The models Llama 3.1 8B, Mistral 7B, and Mistral NeMo 12 B performed comparably…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

stefan-m-lenz/urollmeval
noneOfficial

Datasets

stefan-m-lenz/UroLlmEvalSet
dataset· 46 dl
46 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLLaMA