# A Semi-Supervised Machine Learning Approach to Detecting Recurrent   Metastatic Breast Cancer Cases Using Linked Cancer Registry and Electronic   Medical Record Data

**Authors:** Albee Y. Ling, Allison W. Kurian, Jennifer L. Caswell-Jin, George W., Sledge Jr., Nigam H. Shah, Suzanne R. Tamang

arXiv: 1901.05958 · 2021-07-22

## TL;DR

This study presents a semi-supervised machine learning framework that effectively detects recurrent metastatic breast cancer cases by integrating linked electronic medical records and cancer registry data, enhancing population-based cancer research.

## Contribution

The paper introduces a novel semi-supervised approach combining EMR and registry data for accurate MBC detection without expert-labeled training data.

## Key findings

- Model achieved AUC of 0.925 in detecting MBC.
- High sensitivity (0.861) and specificity (0.878) in classification.
- Framework enables large-scale population research on cancer recurrence.

## Abstract

Objectives: Most cancer data sources lack information on metastatic recurrence. Electronic medical records (EMRs) and population-based cancer registries contain complementary information on cancer treatment and outcomes, yet are rarely used synergistically. To enable detection of metastatic breast cancer (MBC), we applied a semi-supervised machine learning framework to linked EMR-California Cancer Registry (CCR) data. Materials and Methods: We studied 11,459 female patients treated at Stanford Health Care who received an incident breast cancer diagnosis from 2000-2014. The dataset consisted of structured data and unstructured free-text clinical notes from EMR, linked to CCR, a component of the Surveillance, Epidemiology and End Results (SEER) database. We extracted information on metastatic disease from patient notes to infer a class label and then trained a regularized logistic regression model for MBC classification. We evaluated model performance on a gold standard set of set of 146 patients. Results: There are 495 patients with de novo stage IV MBC, 1,374 patients initially diagnosed with Stage 0-III disease had recurrent MBC, and 9,590 had no evidence of metastatis. The median follow-up time is 96.3 months (mean 97.8, standard deviation 46.7). The best-performing model incorporated both EMR and CCR features. The area under the receiver-operating characteristic curve=0.925 [95% confidence interval: 0.880-0.969], sensitivity=0.861, specificity=0.878 and overall accuracy=0.870. Discussion and Conclusion: A framework for MBC case detection combining EMR and CCR data achieved good sensitivity, specificity and discrimination without requiring expert-labeled examples. This approach enables population-based research on how patients die from cancer and may identify novel predictors of cancer recurrence.

---
Source: https://tomesphere.com/paper/1901.05958