# Impact of cancer outcome data source on the diagnostic accuracy of ovarian cancer prediction models: a primary care cohort study

**Authors:** Yi Ting Yu, Fiona M Walter, Kirsten D Arendse, Garth Funston

PMC · DOI: 10.1136/bmjph-2025-004229 · BMJ Public Health · 2026-03-25

## TL;DR

This study shows that using cancer registry data improves the accuracy of ovarian cancer prediction models compared to other data sources.

## Contribution

The study demonstrates the impact of data source choice on diagnostic model accuracy for ovarian cancer in primary care.

## Key findings

- Ovatools model had highest accuracy when using cancer registry data (AUC 0.924).
- Sensitivity was highest with cancer registry data (73.2%) at the 3% risk threshold.
- Combining data sources increased positive predictive value but reduced AUC compared to registry-only data.

## Abstract

Electronic health records are widely used to develop diagnostic prediction models for cancer. Some studies use cancer registry (CR) data, the gold standard for cancer case recordings, whereas others rely on data from alternative healthcare sources. We aimed to evaluate the impact of using CR and non-CR data sources on the diagnostic accuracy of the Ovatools ovarian cancer (OC) risk prediction model.

Retrospective cohort study using linked Clinical Practice Research Datalink (CPRD), hospital episodic statistics (HES) and CR data from women tested for cancer antigen 125 (CA125) in England (1 May 2011–31 December 2017). Ovatools model performance and diagnostic accuracy were compared when different data sources were used, alone and in combination, to identify the outcome, OC diagnosis in the year after CA125 testing. Threshold accuracy was measured at the National Institute for Health and Care Excellence ≥3% risk threshold.

Among 340 769 CA125-tested women, OC incidence within 12 months was highest when using HES data (0.84%), compared with CR (0.75%) and CPRD (0.65%). Area under the curve was highest when using CR alone (0.924) and lower using CPRD (0.903) or CR+CPRD+HES (0.892). At a ≥3% risk threshold, sensitivity was highest when using CR data (73.2%) and lower using CPRD (68.8%). The positive predictive value was lowest using CPRD (13.8%) and highest using CPRD+CR+HES (19.4%).

Using an OC exemplar, we found moderate variation in model performance and threshold accuracy when different data sources were used to define cancer. To ensure cancer prediction models perform as expected in real world clinical practice, gold standard data sources, such as CR data, should be used for model development and validation.

## Linked entities

- **Proteins:** MUC16 (mucin 16, cell surface associated)
- **Diseases:** ovarian cancer (MONDO:0005140)

## Full-text entities

- **Genes:** MUC16 (mucin 16, cell surface associated) [NCBI Gene 94025] {aka CA125}
- **Diseases:** OC (MESH:D010051), cancer (MESH:D009369)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC13034229/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC13034229/full.md

## References

31 references — full list in the complete paper: https://tomesphere.com/paper/PMC13034229/full.md

---
Source: https://tomesphere.com/paper/PMC13034229