# Clustering Breast Cancer Patients Based on Their Treatment Courses Using German Cancer Registry Data

**Authors:** Kolja Blohm, David Korfkamp, Florian Oesterling, Klaas Dählmann, Stefanie Schulze, Andreas Hein

PMC · DOI: 10.1055/a-2753-9631 · Methods of Information in Medicine · 2025-12-10

## TL;DR

This study uses machine learning to group breast cancer patients based on their treatment histories, revealing patterns and survival differences that could improve cancer registry data quality and analysis.

## Contribution

A novel similarity measure adapted from Levenshtein distance is introduced for clustering breast cancer treatment courses in cancer registries.

## Key findings

- Clustering revealed clinically plausible groups and unexpected treatment patterns in breast cancer patients.
- Survival analysis showed differences in survival outcomes between clusters in both favorable and unfavorable subgroups.
- The method can identify data inconsistencies and support hypothesis generation for quality monitoring in cancer registries.

## Abstract

Cancer registries collect extensive data on cancer patients, including diagnoses, treatments, and disease progression. These data offer valuable insights into cancer care, but it is challenging to analyze due to its complexity. Machine learning techniques, particularly clustering, enable the exploration of treatment data to uncover previously unknown patterns and relationships.

This work aimed to develop a method for clustering breast cancer patients in cancer registries based on their treatment courses, to demonstrate the usefulness of clustering for gaining insights, improving data quality, and identifying clinically relevant patterns.

We developed a similarity measure adapted from the Levenshtein distance to compare treatment courses, incorporating cancer diagnosis, surgeries, radiotherapies, and systemic therapies. The method was evaluated on 17,822 breast cancer cases diagnosed in 2019 from the cancer registry of North Rhine-Westphalia. Evaluation involved two stages: first, domain experts reviewed the clustering results to assess clinical relevance and interpretability. Second, an intercluster survival analysis was performed to identify clinically relevant differences between treatment patterns.

Expert evaluations confirmed that clustering produced clinically plausible groups while also uncovering unexpected treatment patterns and potential data inconsistencies. The survival analysis showed differences in survival between clusters in both prognostically favorable and unfavorable subgroups. These results demonstrate that treatment-course clustering can identify patient groups with differing survival outcomes. However, registry data incompleteness and unmeasured confounders may influence these findings.

Clustering treatment courses in cancer registries can reveal data quality issues, distinguish groups with different prognostic profiles, and support exploratory analyses of treatment patterns. While these findings are not intended to guide clinical decision making or evaluate treatment effectiveness, they can help generate hypotheses, identify unexpected care pathways, and support quality monitoring within cancer registries. Future work should focus on improving treatment data completeness, incorporating additional clinical variables, and refining clustering methods for broader applicability.

## Linked entities

- **Diseases:** breast cancer (MONDO:0004989)

## Full-text entities

- **Diseases:** Cancer (MESH:D009369), Breast Cancer (MESH:D001943)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12991862/full.md

## Figures

10 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12991862/full.md

## References

20 references — full list in the complete paper: https://tomesphere.com/paper/PMC12991862/full.md

---
Source: https://tomesphere.com/paper/PMC12991862