# Clinical text mining of the performance status and progression-free survival to facilitate data collection in cancer research: an exploratory study

**Authors:** L. Lin, M. Singer-van den Hout, L.F.A. Wessels, A.J. de Langen, J.H. Beijnen, A.D.R. Huitema

PMC · DOI: 10.1016/j.esmorw.2024.100059 · ESMO Real World Data and Digital Oncology · 2024-08-13

## TL;DR

This study explores using text mining to automatically extract patient performance status and survival data from electronic medical records, reducing the need for manual data collection in cancer research.

## Contribution

A rule-based text mining approach was developed and validated for extracting performance status and progression-free survival from Dutch clinical text.

## Key findings

- The text mining approach achieved a 96.5% weighted F1-score for performance status extraction.
- The median progression-free survival was 8.00 months for text-mined data, close to the manually curated 7.42 months.
- The C-index of 0.916 indicates strong concordance between text-mined and manual data.

## Abstract

Modern electronic medical records (EMRs) contain a valuable amount of data. These data can be unlocked for research by manual data collection, which is highly labor intensive. Therefore, we explored whether automated text mining (TM) could be used to extract the performance status (PS) and progression-free survival (PFS) in a cohort of 328 non-small-cell lung cancer patients.

Unstructured Dutch text data were derived from different EMR fields containing mainly information recorded during outpatient visits. A rule-based TM approach using regular expressions was used to extract PS and PFS in the R programming language. For PS, quantitative evaluation metrics, such as the weighted F1-score, were used to determine the accuracy of the TM-extracted data. For PFS, the median PFS was compared between the two approaches using the Kaplan–Meier method. In addition, the C-index was determined.

A PS was obtained for 196 patients (60%) using the TM approach. In 189 (96%) patients, the TM-curated PS matched the manually curated PS. The weighted F1-score was 96.5%. The median PFS was 7.42 months for the manually curated data (n = 328) and 8.00 months for the TM-curated data (n = 301). The C-index was 0.916.

The developed TM approach is able to extract PS and PFS from the EMR with a very good performance. Therefore, this approach increases the efficiency of reliable data collection from EMRs, facilitating the use of real-world data (RWD) in clinical research.

•Manual data collection from EMRs is highly labor intensive.•TM techniques can increase the efficiency of data collection.•TM tools are essential to advance artificial intelligence models using RWD.

Manual data collection from EMRs is highly labor intensive.

TM techniques can increase the efficiency of data collection.

TM tools are essential to advance artificial intelligence models using RWD.

## Linked entities

- **Diseases:** non-small-cell lung cancer (MONDO:0005233)

## Full-text entities

- **Diseases:** cancer (MESH:D009369), non-small-cell lung cancer (MESH:D002289)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12836783/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12836783/full.md

## References

11 references — full list in the complete paper: https://tomesphere.com/paper/PMC12836783/full.md

---
Source: https://tomesphere.com/paper/PMC12836783