# ArcTEX—a novel clinical data enrichment pipeline to support real-world evidence oncology studies

**Authors:** Keiran Tait, Joseph Cronin, Olivia Wiper, Jamie Wallis, Jim Davies, Robert Dürichen

PMC · DOI: 10.3389/fdgth.2025.1561358 · Frontiers in Digital Health · 2025-05-09

## TL;DR

ArcTEX is a new pipeline that accurately extracts oncology data from unstructured clinical notes, even with limited resources.

## Contribution

ArcTEX introduces a high-accuracy, privacy-preserving pipeline for extracting clinical features from EHR notes in resource-limited settings.

## Key findings

- ArcTEX achieves 98.67% mean accuracy for clinical features in endometrial and breast cancer.
- The model adapts to new oncology areas with only 50 annotated examples, maintaining 95% mean accuracy.

## Abstract

Data stored within electronic health records (EHRs) offer a valuable source of information for real-world evidence (RWE) studies in oncology. However, many key clinical features are only available within unstructured notes. We present ArcTEX, a novel data enrichment pipeline developed to extract oncological features from NHS unstructured clinical notes with high accuracy, even in resource-constrained environments where availability of GPUs might be limited. By design, the predicted outcomes of ArcTEX are free of patient-identifiable information, making this pipeline ideally suited for use in Trust environments. We compare our pipeline to existing discriminative and generative models, demonstrating its superiority over approaches such as Llama3/3.1/3.2 and other BERT based models, with a mean accuracy of 98.67% for several essential clinical features in endometrial and breast cancer. Additionally, we show that as few as 50 annotated training examples are needed to adapt the model to a different oncology area, such as lung cancer, with a different set of priority clinical features, achieving a comparable mean accuracy of 95% on average.

## Linked entities

- **Diseases:** endometrial cancer (MONDO:0002447), breast cancer (MONDO:0004989), lung cancer (MONDO:0005138)

## Full-text entities

- **Diseases:** oncological (MESH:D000072716), lung cancer (MESH:D008175), endometrial and breast cancer (MESH:C537243)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12098606/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12098606/full.md

## References

46 references — full list in the complete paper: https://tomesphere.com/paper/PMC12098606/full.md

---
Source: https://tomesphere.com/paper/PMC12098606